Python數據處理實戰——使用Scikit-Learn進行多類文本分類

最新 03-02

【導讀】本文是數據科學家Susan Li撰寫的一篇技術博文，主要介紹了在商業中使用多類文本分類的應用，並詳細講解了使用Scikit-Learn工具包進行文本分類的步驟。Scikit-Learn是強大的數據分析工具，能勝任很多數據分析任務，如消費者投訴、垃圾郵件過濾和情感分析等。本文就以消費者投訴問題為例，分別介紹問題定義、數據搜索、分析不平衡類、文本表示、分類器訓練、模型選擇、模型評估等步驟，為我們詳細展示Scikit-Learn在案例中每個步驟中的用法。

專知內容組已推出其擴展版，利用PySpark處理大數據文本多分類問題：

Multi-Class Text Classification with Scikit-Learn

使用Scikit-Learn進行多類文本分類

在商業世界中有很多文本分類的應用。例如，新聞報道通常按主題組織;內容或產品通常按類別加標籤;可以根據用戶在線討論某個產品或品牌的行為信息將其劃分為多個群組。

然而，互聯網上絕大多數的文本分類文章和教程都是二類文本分類，如垃圾郵件過濾（垃圾郵件與非垃圾郵件），情感分析（正面與負面）。在大多數情況下，我們現實世界的問題比這更複雜。因此，這就是我們今天要做的事情：將消費者金融投訴分為12個預先定義的類別。數據可以從data.gov[1]下載。

我們使用(Python)[https://www.python.org/]和(Jupyter Notebook)[http://jupyter.org/]來開發我們的系統，並依靠Scikit-Learn來作為機器學習組件來進行數據分析。

問題描述

我們的問題是有監督文本分類問題，我們的目標是調查哪種有監督機器學習方法最適合解決它。

給定一個投訴，我們希望將其分配到12個類別之一。分類器假定每個新投訴都被分配到一個且僅一個類別。這是多類文本分類問題。我迫不及待地想看看我們能做些什麼！

▌數據探索

在深入研究機器學習模型之前，我們首先應該看一些例子，以及每個類中的投訴數量：

importpandasaspd

df = pd.read_csv("Consumer_Complaints.csv")

df.head()

對於這個項目，我們只需要其中的兩欄 - 「產品」和「消費者投訴敘述（Consumer complaint narrative）」。「消費者投訴敘述（Consumer complaint narrative）作為我們的輸入」，「產品」作為輸出，即輸入的類別

我們將刪除「Consumer complaints narrative」欄中的缺失值，並添加一列來編碼產品作為整數描述，因為類別變數通常比整數字元串更好。

我們還創建了幾個字典供將來使用。

清理完成後，可以展示前五行數據：

fromioimportStringIO

col = ["Product","Consumer complaint narrative"]

df = df[col]

df = df[pd.notnull(df["Consumer complaint narrative"])]

df.columns = ["Product","Consumer_complaint_narrative"]

df["category_id"] = df["Product"].factorize()[]

category_id_df = df[["Product","category_id"]].drop_duplicates().sort_values("category_id")

category_to_id = dict(category_id_df.values)

id_to_category = dict(category_id_df[["category_id","Product"]].values)

df.head()

▌不平衡類

我們看到每類產品的投訴數量不平衡。消費者的抱怨更偏向於收賬（Debt collection）、信用報告（Credit reporting）和抵押（Mortgage）。

importmatplotlib.pyplotasplt

fig = plt.figure(figsize=(8,6))

df.groupby("Product").Consumer_complaint_narrative.count().plot.bar(ylim=)

plt.show()

當我們遇到這種問題時，標準方法往往會遇到一些問題。常規演算法往往偏向於數量多的類別，而沒有考慮數據分布。在最糟糕的情況下，少數樣本被視為異常值並被忽略。對於某些情況，例如欺詐檢測或癌症預測，我們需要仔細配置我們的模型或人為地平衡數據集，例如通過將某個類欠採樣或將某類過採樣[2]。

但是，在我們學習不平衡數據的情況下，主要的類別可能會引起更大的注意。我們希望有一個分類器能夠對多數類提供較高的預測精度，同時保持少數類別的合理準確性。

▌文本表示

分類器和學習演算法不能直接處理文本文檔的原始形式，因為大多數演算法需要固定大小的數值特徵向量而不是具有可變長度的原始文本文檔。因此，在預處理步驟中，文本被轉換為更可行的特徵表示。

從文本中提取特徵的一種常見方法是使用詞袋模型（bag of words model）：對於每個文檔，它是一個投訴敘述內容（a complaint narrative），出現的單詞（通常是頻率）被考慮在內，但是不同單詞的順序被忽略（詞序被忽略）。

具體而言，對於我們數據集中的每個項，我們將計算詞頻（TF），反向文檔頻率（縮寫為tf-idf）的度量。我們將使用sklearn.feature_extraction.text.TfidfVectorizer為每個文檔計算一個tf-idf向量：

fromsklearn.feature_extraction.textimportTfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True,min_df=5,norm="l2",encoding="latin-1",

ngram_range=(1,2),stop_words="english")

features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()

labels = df.category_id

features.shape

（4569,12633）

現在，輸入是4569個文檔，每篇是由12633個特徵表示，代表不同的unigrams和bigrams的tf-idf分數。

我們可以使用sklearn.feature_selection.chi2來查找與每個類別最相關的項：

fromsklearn.feature_selectionimportchi2

importnumpyasnp

N =2

forProduct,category_idinsorted(category_to_id.items()):

features_chi2 = chi2(features,labels == category_id)

indices = np.argsort(features_chi2[])

feature_names = np.array(tfidf.get_feature_names())[indices]

unigrams = [vforvinfeature_namesiflen(v.split(" ")) ==1]

bigrams = [vforvinfeature_namesiflen(v.split(" ")) ==2]

print("# "{}":".format(Product))

print(" . Most correlated unigrams:
. {}".format("
. ".join(unigrams[-N:])))

print(" . Most correlated bigrams:
. {}".format("
. ".join(bigrams[-N:])))

# 『Bank account or service』:

. Most correlated unigrams:

. bank

. overdraft

. Most correlated bigrams:

. overdraft fees

. checking account

# 『Consumer Loan』:

. Most correlated unigrams:

. car

. vehicle

. Most correlated bigrams:

. vehicle xxxx

. toyota financial

# 『Credit card』:

. Most correlated unigrams:

. citi

. card

. Most correlated bigrams:

. annual fee

. credit card

# 『Credit reporting』:

. Most correlated unigrams:

. experian

. equifax

. Most correlated bigrams:

. trans union

. credit report

# 『Debt collection』:

. Most correlated unigrams:

. collection

. debt

. Most correlated bigrams:

. collect debt

. collection agency

# 『Money transfers』:

. Most correlated unigrams:

. wu

. paypal

. Most correlated bigrams:

. western union

. money transfer

# 『Mortgage』:

. Most correlated unigrams:

. modification

. mortgage

. Most correlated bigrams:

. mortgage company

. loan modification

# 『Other financial service』:

. Most correlated unigrams:

. dental

. passport

. Most correlated bigrams:

.helppay

. stated pay

# 『Payday loan』:

. Most correlated unigrams:

. borrowed

. payday

. Most correlated bigrams:

. big picture

. payday loan

# 『Prepaid card』:

. Most correlated unigrams:

. serve

. prepaid

. Most correlated bigrams:

. access money

. prepaid card

# 『Student loan』:

. Most correlated unigrams:

. student

. navient

. Most correlated bigrams:

. student loans

. student loan

# 『Virtual currency』:

. Most correlated unigrams:

. handles

. https

. Most correlated bigrams:

. xxxx provider

. money want

上面的展示結果還不錯。

▌多類分類器：特徵和設計

在完成上述數據轉換之後，現在我們擁有所有文檔的特徵和類別信息，現在對分類器進行訓練了。我們可以使用許多演算法來解決這類問題。

fromsklearn.model_selectionimporttrain_test_split

fromsklearn.feature_extraction.textimportCountVectorizer

fromsklearn.feature_extraction.textimportTfidfTransformer

fromsklearn.naive_bayesimportMultinomialNB

X_train,X_test,y_train,y_test = train_test_split(df["Consumer_complaint_narrative"],df

["Product"],random_state=)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf,y_train)

在擬合訓練集之後，得到模型了，讓我們來進行預測

print(clf.predict(count_vect.transform(["This company refuses to provide me verification

and validation of debt per my right under the FDCPA. I do not believe this debt is mine."

])))

[『Debt collection』]

df[df["Consumer_complaint_narrative"] =="This company refuses to provide me verification

and validation of debt per my right under the FDCPA. I do not believe this debt is mine."]

print(clf.predict(count_vect.transform(["I am disputing the inaccurate information the

Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16

and Chex Systems only deleted the items that I mentioned in the letter and not all the

items that were actually listed on the police report. In other words they wanted me to

say word for word to them what items were fraudulent. The total disregard of the police

report and what accounts that it states that are fraudulent. If they just had paid a

little closer attention to the police report I would not been in this position now and

they would n"t have to research once again. I would like the reported information to be

removed : XXXX XXXX XXXX"])))

[『Credit reporting』]

df[df["Consumer_complaint_narrative"] =="I am disputing the inaccurate information the

Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16

and Chex Systems only deleted the items that I mentioned in the letter and not all the

items that were actually listed on the police report. In other words they wanted me to say

word for word to them what items were fraudulent. The total disregard of the police report

and what accounts that it states that are fraudulent. If they just had paid a little closer

attention to the police report I would not been in this position now and they would n"t have

to research once again. I would like the reported information to be removed : XXXX XXXX XXXX

上面結果還行。

▌模型選擇

我們現在準備嘗試不同的機器學習模型，評估它們的準確性並找出一些潛在的問題。

我們以下四種模型作為benchmark：

fromsklearn.linear_modelimportLogisticRegression

fromsklearn.ensembleimportRandomForestClassifier

fromsklearn.svmimportLinearSVC

fromsklearn.model_selectionimportcross_val_score

models = [

RandomForestClassifier(n_estimators=200,max_depth=3,random_state=),

LinearSVC(),

MultinomialNB(),

LogisticRegression(random_state=),

]

CV =5

cv_df = pd.DataFrame(index=range(CV *len(models)))

entries = []

formodelinmodels:

model_name = model.__class__.__name__

accuracies = cross_val_score(model,features,labels,scoring="accuracy",cv=CV)

forfold_idx,accuracyinenumerate(accuracies):

entries.append((model_name,fold_idx,accuracy))

cv_df = pd.DataFrame(entries,columns=["model_name","fold_idx","accuracy"])

importseabornassns

sns.boxplot(x="model_name",y="accuracy",data=cv_df)

sns.stripplot(x="model_name",y="accuracy",data=cv_df,

size=8,jitter=True,edgecolor="gray",linewidth=2)

plt.show()

cv_df.groupby("model_name").accuracy.mean()

model_name

LinearSVC: 0.822890

LogisticRegression: 0.792927

MultinomialNB: 0.688519

RandomForestClassifier: 0.443826

Name: accuracy, dtype: float64

LinearSVC和Logistic回歸比其他兩個分類器執行得更好，LinearSVC具有輕微的優勢，精度（median accuracy）約為82％。

▌模型評估

繼續使用我們的最佳模型（LinearSVC），我們將查看混淆矩陣（confusion matrix），並顯示預測標籤和實際標籤之間的差異。

model = LinearSVC()

X_train,X_test,y_train,y_test,indices_train,indices_test = train_test_split(features,

labels,df.index,test_size=0.33,random_state=)

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

fromsklearn.metricsimportconfusion_matrix

conf_mat = confusion_matrix(y_test,y_pred)

fig,ax = plt.subplots(figsize=(10,10))

sns.heatmap(conf_mat,annot=True,fmt="d",

xticklabels=category_id_df.Product.values,

yticklabels=category_id_df.Product.values)

plt.ylabel("Actual")

plt.xlabel("Predicted")

plt.show()

絕大多數預測結果都在對角線（預測標籤=實際標籤），這也是我們所希望的。但是，存在一些錯誤分類，我們看看是什麼造成的：

fromIPython.displayimportdisplay

forpredictedincategory_id_df.category_id:

foractualincategory_id_df.category_id:

ifpredicted != actualandconf_mat[actual,predicted] >=10:

print(""{}" predicted as "{}" : {} examples.".format(id_to_category[actual],

id_to_category[predicted],conf_mat[actual,predicted]))

display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]]

[["Product","Consumer_complaint_narrative"]])

print("")

正如你所看到的，一些錯誤分類的文檔是涉及多個主題的投訴（例如涉及信用卡和信用報告的投訴）。這種錯誤會導致一些問題。

然後，我們使用卡方檢驗來查找與每個類別最相關的詞：

model.fit(features,labels)

N =2

forProduct,category_idinsorted(category_to_id.items()):

indices = np.argsort(model.coef_[category_id])

feature_names = np.array(tfidf.get_feature_names())[indices]

unigrams = [vforvinreversed(feature_names)iflen(v.split(" ")) ==1][:N]

bigrams = [vforvinreversed(feature_names)iflen(v.split(" ")) ==2][:N]

print("# "{}":".format(Product))

print(" . Top unigrams:
. {}".format("
. ".join(unigrams)))

print(" . Top bigrams:
. {}".format("
. ".join(bigrams)))

# 『Bank account or service』:

. Top unigrams:

. bank

. account

. Top bigrams:

. debit card

. overdraft fees

# 『Consumer Loan』:

. Top unigrams:

. vehicle

. car

. Top bigrams:

. personal loan

. history xxxx

# 『Credit card』:

. Top unigrams:

. card

. discover

. Top bigrams:

. credit card

. discover card

# 『Credit reporting』:

. Top unigrams:

. equifax

. transunion

. Top bigrams:

. xxxx account

. trans union

# 『Debt collection』:

. Top unigrams:

. debt

. collection

. Top bigrams:

. account credit

. time provided

# 『Money transfers』:

. Top unigrams:

. paypal

. transfer

. Top bigrams:

. money transfer

. send money

# 『Mortgage』:

. Top unigrams:

. mortgage

. escrow

. Top bigrams:

. loan modification

. mortgage company

# 『Other financial service』:

. Top unigrams:

. passport

. dental

. Top bigrams:

. stated pay

.helppay

# 『Payday loan』:

. Top unigrams:

. payday

. loan

. Top bigrams:

. payday loan

. pay day

# 『Prepaid card』:

. Top unigrams:

. prepaid

. serve

. Top bigrams:

. prepaid card

. use card

# 『Student loan』:

. Top unigrams:

. navient

. loans

. Top bigrams:

. student loan

. sallie mae

# 『Virtual currency』:

. Top unigrams:

. https

. tx

. Top bigrams:

. money want

. xxxx provider

它們符合我們的預期。

最後，我們列印出每個類別的分類報告：

fromsklearnimportmetrics

print(metrics.classification_report(y_test,y_pred,target_names=df["Product"].unique()))

代碼鏈接：

https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb

[1] https://catalog.data.gov/dataset/consumer-complaint-database

[2] https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

參考文獻：

https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

-END-

專 · 知

人工智慧領域主題知識資料查看獲取：【專知薈萃】人工智慧領域26個主題知識資料全集（入門/進階/論文/綜述/視頻/專家等）

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 Python 的精彩文章:

※小白問Python頂級大牛：你這個級別的，你最怕什麼！
※Python徒手實現識別手寫數字—對圖片結果進行加權處理

TAG:Python |