教程 | 如何通過Scikit-Learn實現多類別文本分類？

科技 03-07

選自towardsdatascience

作者：

Susan Li

機器之心編譯

參與：程耀彤、黃小天

互聯網的絕大多數的文本分類都是二進位的，本文要解決的問題更為複雜。作者使用 Python 和 Jupyter Notebook 開發系統，並藉助 Scikit-Learn 實現了消費者金融投訴的 12 個預定義分類。本項目的 GitHub 地址見文中。

GitHub 地址：https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb

商業活動中有很多文本分類應用。例如，新聞報道通常是按照主題進行構架；內容或產品通常是根據類別添加標籤；可以根據用戶如何在線討論某個產品或品牌將其分為多個群組......

然而，互聯網上絕大多數的文本分類文章和教程都是二進位文本分類，比如垃圾郵件過濾，情感分析。大多數情況下，現實世界的問題更為複雜。因此，這就是我們今天要做的事情：將消費者的金融投訴分為 12 個預定義的類別。

我們使用 Python 和 Jupyter Notebook 開發系統，機器學習方面則藉助 Scikit-Learn。如果你想要 PySpark 實現，請閱讀下篇文章。

問題表述

該問題是監督式文本分類問題，我們的目標是調查哪種監督式機器學習方法最適合解決它。

當出現新投訴時，我們希望將其分配到 12 個類別中的一個。分類器假設每個新投訴都被分配到一個且僅一個的類別之中。這是多類別文本分類問題。我迫不及待想看到我們能實現什麼！

數據探索

在深入訓練機器學習模型之前，我們首先應該看一些實例，以及每個類別的投訴數量：

import

 pandas as
 pd
df = pd.read_csv("Consumer_Complaints.csv"

) df.head()

對於這個項目，我們只需要兩欄——「產品」和「消費者投訴敘述」。

輸入： Consumer_complaint_narrative

實例：「我的信用報告中有過時的信息，我以前有爭議的是這些信息已超過七年未被刪除，並且不符合信用報告的要求」

輸出：product

實例：信用報告

我們將刪除「消費者投訴敘述」欄中的缺失值，並添加一列來將產品編碼為整數，因為分類變數通常用整數表示比用字元串更好。

我們還創建了幾個字典供將來使用。

清理完成後，這是我們將要處理的前五行數據：

from

 io import
 StringIO
col = ["Product"
, "Consumer complaint narrative"
]
df = df[col]
df = df[pd.notnull(df["Consumer complaint narrative"
])]
df.columns = ["Product"
, "Consumer_complaint_narrative"
]
df["category_id"
] = df["Product"
].factorize()[0
]
category_id_df = df[["Product"
, "category_id"
]].drop_duplicates().sort_values("category_id"
)
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[["category_id"
, "Product"
]].values)
df.head()

不平衡類

我們看到每件產品的投訴數量不平衡。消費者的投訴更集中於收取欠款、信用報告和抵押方面。

import

 matplotlib.pyplot as
 plt
fig = plt.figure(figsize=(8
,6
))
df.groupby("Product"
).Consumer_complaint_narrative.count().plot.bar(ylim=0
)
plt.show()

當我們遇到這樣的問題時，我們使用標準演算法解決這些問題必然會遇到困難。常規演算法往往偏向於多數類別，而不考慮數據分布。在最糟糕的情況下，少數類別被視為異常值並被忽略。對於某些情況，如欺詐檢測或癌症預測，我們則需要仔細配置我們的模型或人為地平衡數據集，比如欠採樣或過採樣每個類別。

但是，在學習不平衡數據的情況下，我們最感興趣的是多數類。我們想有一個分類器，能夠對多數類提供較高的預測精度，同時對少數類保持合理的準確度。因此我們會保持原樣。

文本表達

分類器和學習演算法不能直接處理原始形式的文本文檔，因為它們大多數都期望大小固定的數字特徵向量而不是具有可變長度的原始文本文檔。因此，在預處理步驟中，文本被轉換為更易於管理的表達。

從文本中提取特徵的一種常見方法是使用詞袋模型：對於每個文檔，我們案例中的投訴敘述、單詞的出現（通常是頻率）被考慮在內，而它們出現順序則被忽略。

具體來說，對於我們數據集中的每一項，我們將計算一種被稱為詞頻、反向文檔頻率的值，其縮寫為 tf-idf。我們將使用 sklearn.feature_extraction.text.TfidfVectorizer 為每個消費者投訴敘述計算一個 tf-idf 向量。

sublinear_df 設為 True 從而使用頻率的對數形式。

min_df 是單詞必須存在的最小文檔數量。

norm 設為 l2，以確保我們所有特徵向量的歐幾里德範數為 1。

ngram_range 設為 (1, 2)，表示我們想要考慮 unigrams 和 bigrams。

stop_words 設為 "english" 來刪除所有常用代詞 ("a", "the", ...) 以減少噪音特徵的數量。

from

 sklearn.feature_extraction.text import
 TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True
, min_df=5
, norm="l2"
, encoding="latin-1"
, ngram_range=(1
, 2
), stop_words="english"
)
features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
labels = df.category_id
features.shape

(4569, 12633)

現在，4569 個消費者投訴描述中的每一個由 12633 個特徵表達，代表不同的 unigrams 和 bigrams 的 tf-idf 分數。

我們可以使用 sklearn.feature_selection.chi2 來查找與每個產品最相關的項：

from

 sklearn.feature_selection import
 chi2
import
 numpy as
 np
N = 2

for
 Product, category_id in
 sorted(category_to_id.items()):
 features_chi2 = chi2(features, labels == category_id)
 indices = np.argsort(features_chi2[0
])
 feature_names = np.array(tfidf.get_feature_names())[indices]
 unigrams = [v for
 v in
 feature_names if
 len(v.split(" "
)) == 1
]
 bigrams = [v for
 v in
 feature_names if
 len(v.split(" "
)) == 2
]
 print("# "{}":"
.format(Product))
 print(" . Most correlated unigrams:

. {}"
.format("

. "
.join(unigrams[-N:])))
 print(" . Most correlated bigrams:

. {}"
.format("

. "
.join(bigrams[-N:])))

# 『銀行賬戶或服務』:

. 最相關的 unigrams:

. 銀行

. 透支

. 最相關的 bigrams:

. 透支費

. 支票賬戶

# 『消費者貸款』:

. 最相關的 unigrams:

. 車

. 交通工具

. 最相關的 bigrams:

. 交通工具 xxxx

. 豐田金融

# 『信用卡』:

. 最相關的 unigrams:

. 花旗

. 卡

. 最相關的 bigrams:

. 年費

. 信用卡

# 『信用報告』:

. 最相關的 unigrams:

. 益百利

. equifax

. 最相關的 bigrams:

. 全聯公司

. 信用報告

# 『討回欠款』:

. 最相關的 unigrams:

. 收集

. 債務

. 最相關的 bigrams:

. 討回全款

. 討債公司

# 『匯款』:

. 最相關的 unigrams:

. wu

. paypal

. 最相關的 bigrams:

. 西聯匯款

. 匯款

# 『抵押』:

. 最相關的 unigrams:

. 修正

. 抵押

. 最相關的 bigrams:

. 抵押公司

. 貸款修改

# 『其他金融服務』:

. 最相關的 unigrams:

. 牙齒

. 護照

. 最相關的 bigrams:

. 幫助支付

. 規定支付

# 『發薪日貸款』:

. 最相關的 unigrams:

. 借款

. 發薪日

. 最相關的 bigrams:

. 大圖片

. 發薪日貸款

# 『預付卡』:

. 最相關的 unigrams:

. 服務

. 充值

. 最相關的 bigrams:

. 獲得資金

. 預付卡

# 『學生貸款』:

. 最相關的 unigrams:

. 學生

. navient

. 最相關的 bigrams:

. student loans

. student loan

# 『虛擬貨幣』:

. 最相關的 unigrams:

. 手柄

. https

. 最相關的 bigrams:

. xxxx 提供者

. 想要錢

它們都有道理，難道不是嗎？

多類別分類器：特徵和設計

為了訓練監督式分類器，我們首先將「消費者投訴敘述」轉化為數字向量。我們研究了向量表示，例如 TF-IDF 加權向量。

有了這個向量表達的文本後，我們可以訓練監督式分類器來訓練看不到的「消費者投訴敘述」並預測它們的「產品」。

在完成上述數據轉換之後，現在我們擁有所有的特徵和，是時候訓練分類器了。我們可以使用很多演算法來解決這類問題。

樸素貝葉斯分類器：最適合字數統計的是多項式變體：

from

 sklearn.model_selection import
 train_test_split
from
 sklearn.feature_extraction.text import
 CountVectorizer
from
 sklearn.feature_extraction.text import
 TfidfTransformer
from
 sklearn.naive_bayes import
 MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(df["Consumer_complaint_narrative"
], df["Product"
], random_state = 0
)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)

在擬合好訓練集後，讓我們做一些預測。

print(clf.predict(count_vect.transform([

"This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."

])))

「『收回欠款』」

df[df[

"Consumer_complaint_narrative"

] == "This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."
]

print(clf.predict(count_vect.transform([

"I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n"t have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"

])))

「『信用報告』」

df[df[

"Consumer_complaint_narrative"

] == "I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n"t have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"
]

不是太寒酸！

模型選擇

我們現在準備嘗試不同的機器學習模型，評估它們的準確性並找出潛在問題的根源。

我們將對以下四種模型進行基準測試：

Logistic 回歸

（多項式）樸素貝葉斯

線性支持向量機

隨機森林

from

 sklearn.linear_model import
 LogisticRegression
from
 sklearn.ensemble import
 RandomForestClassifier
from
 sklearn.svm import
 LinearSVC
from
 sklearn.model_selection import
 cross_val_score
models = [
 RandomForestClassifier(n_estimators=200
, max_depth=3
, random_state=0
),
 LinearSVC(),
 MultinomialNB(),
 LogisticRegression(random_state=0
),
]
CV = 5

cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for
 model in
 models:
 model_name = model.__class__.__name__
 accuracies = cross_val_score(model, features, labels, scoring="accuracy"
, cv=CV)
 for
 fold_idx, accuracy in
 enumerate(accuracies):
 entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=["model_name"
, "fold_idx"
, "accuracy"
])
import
 seaborn as
 sns
sns.boxplot(x="model_name"
, y="accuracy"
, data=cv_df)
sns.stripplot(x="model_name"
, y="accuracy"
, data=cv_df, 
 size=8
, jitter=True
, edgecolor="gray"
, linewidth=2
)
plt.show()

cv_df.groupby(

"model_name"

).accuracy.mean()

模型名稱

線性支持向量機：0.822890

Logistic 回歸：0.792927

（多項式）樸素貝葉斯：0.688519

隨機森林：0.443826

名稱：精確度，dtype：float64

線性支持向量機和 Logistic 回歸比其他兩個分類器執行的更好，前者具有輕微的優勢，其中位精度約為 82%。

模型評估

繼續使用我們的最佳模型（LinearSVC），我們將查看混淆矩陣，並展示預測標籤和實際標籤之間的差異。

model = LinearSVC() X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=

0.33

, random_state=0
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from
 sklearn.metrics import
 confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(10
,10
))
sns.heatmap(conf_mat, annot=True
, fmt="d"
,
 xticklabels=category_id_df.Product.values, yticklabels=category_id_df.Product.values)
plt.ylabel("Actual"
)
plt.xlabel("Predicted"
)
plt.show()

正如我們所希望的，絕大多數預測都在對角線結束（預測標籤=實際標籤）。然而，仍然存在大量錯誤分類，看看這些是由什麼造成的可能很有趣：

from

 IPython.display import
 display
for
 predicted in
 category_id_df.category_id:
 for
 actual in
 category_id_df.category_id:
 if
 predicted != actual and
 conf_mat[actual, predicted] >= 10
:
 print(""{}" predicted as "{}" : {} examples."
.format(id_to_category[actual], id_to_category[predicted], conf_mat[actual, predicted]))
 display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][["Product"
, "Consumer_complaint_narrative"
]])
 print(""
)

如你所見，一些錯誤分類的投訴涉及多個主題（比如涉及信用卡和信用報告的投訴）。這種錯誤總是發生。

再次，我們使用卡方檢驗來找到與每個類別最相關的項：

model.fit(features, labels) N =

2


for
 Product, category_id in
 sorted(category_to_id.items()):
 indices = np.argsort(model.coef_[category_id])
 feature_names = np.array(tfidf.get_feature_names())[indices]
 unigrams = [v for
 v in
 reversed(feature_names) if
 len(v.split(" "
)) == 1
][:N]
 bigrams = [v for
 v in
 reversed(feature_names) if
 len(v.split(" "
)) == 2
][:N]
 print("# "{}":"
.format(Product))
 print(" . Top unigrams:

 . {}"
.format("

 . "
.join(unigrams)))
 print(" . Top bigrams:

 . {}"
.format("

 . "
.join(bigrams)))

# 『銀行賬戶或服務』:

. 最高的 unigrams:

. 銀行

. 賬戶

. 最高的 bigrams:

. 借記卡

. 透支費用

# 『消費者貸款』:

. 最高的 unigrams:

. 交通工具

. 車

. 最高的 bigrams:

. 個人貸款

. 歷史 xxxx

# 『信用卡』:

. 最高的 unigrams:

. 卡

. 發現

. 最高的 bigrams:

. 信用卡

. 發現卡

# 『信用報告』:

. 最高的 unigrams:

. equifax

. 全聯公司

. 最高的 bigrams:

. xxxx 賬戶

. 全聯公司

# 『討回欠款』:

. 最高的 unigrams:

. 債務

. 收集

. 最高的 bigrams:

. 賬戶信用

. 時間提供

# 『匯款』:

. 最高的 unigrams:

. paypal

. 匯款

. 最高的 bigrams:

. 匯款

. 寄錢

# 『抵押』:

. 最高的 unigrams:

. 抵押

. 國際支付寶

. 最高的 bigrams:

. 貸款修改

. 抵押公司

# 『其他金融服務』:

. 最高的 unigrams:

. 護照

. 牙齒

. 最高的 bigrams:

. 規定支付

. 幫助支付

# 『發薪日貸款』:

. 最高的 unigrams:

. 發薪日

. 貸款

. 最高的 bigrams:

. 發薪日貸款

. 發薪日

# 『預付卡』:

. 最高的 unigrams:

. 充值

. 服務

. 最高的 bigrams:

. 預付卡

. 使用卡

# 『學生貸款』:

. 最高的 unigrams:

. navient

. 貸款

. 最高的 bigrams:

. 學生貸款

. sallie mae

# 『虛擬貨幣』:

. 最高的 unigrams:

. https

. tx

. 最高的 bigrams:

. 想要錢

. xxxx 提供者

它們符合我們的預期。

最後，我們列印出每個類的分類報告：

from

 sklearn import
 metrics
print(metrics.classification_report(y_test, y_pred, target_names=df["Product"
].unique()))

原文鏈接：https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

本文為機器之心編譯，

轉載請聯繫本公眾號獲得授權

。

?------------------------------------------------

加入機器之心（全職記者/實習生）：hr@jiqizhixin.com

投稿或尋求報道：editor@jiqizhixin.com

廣告&商務合作：bd@jiqizhixin.com

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 機器之心 的精彩文章:

※PyTorch一周年戰績總結：是否比TensorFlow來勢兇猛？
※一次搞定多種語言：Facebook展示全新多語言嵌入系統

TAG:機器之心 |