scikit-learn機器學習初體驗

知識 04-01

點擊上方

「

Python開發

」，選擇「置頂公眾號」

關鍵時刻，第一時間送達！

Why

機器學習在圖像和語音識別領域已經有很多成熟的應用，比如：

圖像識別，比如人臉識別

機器翻譯

語音輸入

那機器學習究竟是如何做到這些的呢？本文以圖像識別中比較簡單的數字識別為例來了解一下。

What

scikit-learn 是一個用於數據挖掘和分析的 Python 庫，完全開源並封裝了很多機器學習的演算法，我們可以很方便得對其提供的

SVM

演算法進行訓練並將其用於實際應用場景，解決一些數據量不是很大的問題。

下面的例子將演示如何使用

sklearn

庫中提供的數字圖片 dataset 來識別它們表示的實際數字。

最後畫出的圖形為：

How

安裝 scikit-learn

根據 numpy 官方文檔的說明，官方發布的源碼和包在 SourceForge 上。那麼分別從下面的地址下載安裝最新 release 版本：

SciPy: Scientific Library for Python

Numerical Python

解決了依賴項問題之後，通過下面的命令安裝

sklearn

：

pip install
- U scikit - learn

一個手寫數字識別的例子

scikit_learn 官方有個例子，使用

sklearn

自帶的 dataset 對演算法進行訓練並用於手寫數字識別的例子。

準備工作

按照本文前半部分的步驟安裝
sklearn

安裝
matplotlib
的依賴庫
dateutil

pip install python
-
dateutil

安裝
matplotlib

的依賴庫
pyparsing

pip install pyparsing

安裝科學計算庫 matplotlib

Python 代碼：

# -*- coding: utf-8 -*-
"""
================================
Recognizing hand-written digits
================================
An example showing how the scikit-learn can be used to recognize images of
hand-written digits.
This example is commented in the
:ref:`tutorial section of the user manual <introduction>`.
"""
# 列印出上面的文檔信息，Python 裡面使用 """
# 多行注釋的信息被當作文檔信息，可以對源代碼、類、方法等進行注釋從而生成文檔，是自解釋的機制
print
(
__doc__
)
# 下面是 scikit----learn 官方例子的作者信息
# Author: Gael Varoquaux <gael dot varoquaux at normalesup dot org>
# License: BSD 3 clause
# 導入用於科學計算的庫
import
matplotlib
.
pyplot
as
plt
# 導入 sklearn 自帶的手寫數字 dataset，以及進行機器學習的模塊
from
sklearn
import
datasets
,
svm
,
metrics
# 載入 sklearn 自帶的手寫數字 dataset
digits
=
datasets
.
load_digits
()
# 這裡我們感興趣的數據是不同灰度的 8x8 個小格子組成的圖像
# 如果我們直接使用圖像進行處理，就需要使用 pylab.imread 來載入圖像數據，而且這些圖像數據必須都是 8x8 的格式
# 對於這個 dataset 中的圖像，dataset.target 給出了它們實際對應的數字
images_and_labels
=
list
(
zip
(
digits
.
images
,
digits
.
target
))
for
index
,

(
image
,
label
)

in
enumerate
(
images_and_labels
[:
4
]):
plt
.
subplot
(
2
,

4
,
index
+

1
)
plt
.
axis
(
"off"
)
plt
.
imshow
(
image
,
cmap
=
plt
.
cm
.
gray_r
,
interpolation
=
"nearest"
)
plt
.
title
(
"Training: %i"

%
label
)
# 為了使用分類器，需要將每個表示手寫圖像的 8x8 數字轉換為一個數字數組
# 這樣 digits.images 就變為了(採樣，採樣特性)的一個矩陣
n_samples
=
len
(
digits
.
images
)
data
=
digits
.
images
.
reshape
((
n_samples
,

-
1
))
print
(
digits
.
images
[
0
])
print
(
data
[
0
])
# 創建一個分類器，這裡 gamma 的值是給定的，可以通過 grid search 和 cross validation 等技術算出更好的值。
# 下面的鏈接有個例子是自己算 gamma：
# http://efavdb.com/machine-learning-with-wearable-sensors/
classifier
=
svm
.
SVC
(
gamma
=
0.001
)
# 用前半部分數據訓練分類器
classifier
.
fit
(
data
[:
n_samples
/

2
],
digits
.
target
[:
n_samples
/

2
])
# 對後半部分數據使用訓練好的分類器進行識別
expected
=
digits
.
target
[
n_samples
/

2
:]
predicted
=
classifier
.
predict
(
data
[
n_samples
/

2
:])
# 列印分類器的運行時信息以及期望值和實際識別的值
print
(
"Classification report for classifier %s:n%sn"
%

(
classifier
,
metrics
.
classification_report
(
expected
,
predicted
)))
print
(
"Confusion matrix:n%s"

%
metrics
.
confusion_matrix
(
expected
,
predicted
))
# 畫出手寫數字的圖像並給出識別出的值
images_and_predictions
=
list
(
zip
(
digits
.
images
[
n_samples
/

2
:],
predicted
))
for
index
,

(
image
,
prediction
)

in
enumerate
(
images_and_predictions
[:
4
]):
plt
.
subplot
(
2
,

4
,
index
+

5
)
plt
.
axis
(
"off"
)
plt
.
imshow
(
image
,
cmap
=
plt
.
cm
.
gray_r
,
interpolation
=
"nearest"
)
plt
.
title
(
"Prediction: %i"

%
prediction
)
plt
.
show
()

下面的代碼將輸出注釋中提到的將 8 x 8 數字矩陣轉換為數組：

n_samples
=
len
(
digits
.
images
)
data
=
digits
.
images
.
reshape
((
n_samples
,

-
1
))
print
(
"Number picture matrix:"
)
print
(
digits
.
images
[
0
])
print
(
"flatten array:"
)
print
(
data
[
0
])

其輸出為：

Number
picture matrix
:
[[
0.

0.

5.

13.

9.

1.

0.

0.
]
[
0.

0.

13.

15.

10.

15.

5.

0.
]
[
0.

3.

15.

2.

0.

11.

8.

0.
]
[
0.

4.

12.

0.

0.

8.

8.

0.
]
[
0.

5.

8.

0.

0.

9.

8.

0.
]
[
0.

4.

11.

0.

1.

12.

7.

0.
]
[
0.

2.

14.

5.

10.

12.

0.

0.
]
[
0.

0.

6.

13.

10.

0.

0.

0.
]]
flatten array
:
[
0.

0.

5.

13.

9.

1.

0.

0.

0.

0.

13.

15.

10.

15.

5.
0.

0.

3.

15.

2.

0.

11.

8.

0.

0.

4.

12.

0.

0.

8.
8.

0.

0.

5.

8.

0.

0.

9.

8.

0.

0.

4.

11.

0.

1.
12.

7.

0.

0.

2.

14.

5.

10.

12.

0.

0.

0.

0.

6.

13.
10.

0.

0.

0.
]