用 KNN 來進行驗證碼識別

知識 02-09

（點擊

上方公眾號

，可快速關注）

來源：邱康singasong

https://segmentfault.com/a/1190000006070219

前言

之前做了一個校園交友的APP，其中一個邏輯是通過用戶的教務系統來確認用戶是一名在校大學生，基本的想法是通過用戶的賬號和密碼，用爬蟲的方法來確認信息，但是許多教務系統都有驗證碼，當時是通過本地伺服器去下載驗證碼，然後分發給客戶端，然後讓用戶自己填寫驗證碼，與賬號密碼一併提交給伺服器，然後伺服器再去模擬登錄教務系統以確認用戶能否登錄該教務系統。驗證碼無疑讓我們想使得用戶快速認證的想法破滅了，但是當時也沒辦法，最近看了一些機器學習的內容，覺得對於大多數學校的那些極簡單的驗證碼應該是可以用KNN這種方法來破解的，於是整理了一下思緒，擼起袖子做起來！

分析

我們學校的驗證碼是這樣的：

，其實就是簡單地把字元進行旋轉然後加上一些微弱的噪點形成的。我們要識別，就得逆行之，具體思路就是，首先二值化去掉噪點，然後把單個字元分割出來，最後旋轉至標準方向，然後從這些處理好的圖片中選出模板，最後每次新來一張驗證碼就按相同方式處理，然後和這些模板進行比較，選擇判別距離最近的一個模板作為其判斷結果（亦即KNN的思想，本文取K=1）。接下來按步驟進行說明。

獲得驗證碼

首先得有大量的驗證碼，我們通過爬蟲來實現，代碼如下

#-*- coding:UTF-8 -*-

import

urllib

urllib2

cookielib

string

Image

def getchk

(

number

)

#創建cookie對象

cookielib

LWPCookieJar

()

cookieSupport

urllib2

HTTPCookieProcessor

(

)

opener

urllib2

build_opener

(

cookieSupport

urllib2

HTTPHandler

)

urllib2

install_opener

(

opener

)

#首次與教務系統鏈接獲得cookie#

#偽裝browser

headers

{

"Accept"

"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"

"Accept-Encoding"

"gzip,deflate"

"Accept-Language"

"zh-CN,zh;q=0.8"

"User-Agent"

"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"

}

req0

urllib2

Request

(

url

"http://mis.teach.ustc.edu.cn"

headers

#請求頭

)

# 捕捉http錯誤

try

result0

urllib2

urlopen

(

req0

)

except

urllib2

HTTPError

code

#提取cookie

getcookie

[

for

item

getcookie

append

(

item

name

)

getcookie

append

(

"="

)

getcookie

append

(

item

value

)

getcookie

join

(

getcookie

)

#修改headers

headers

[

"Origin"

]

"http://mis.teach.ustc.edu.cn"

headers

[

"Referer"

]

"http://mis.teach.ustc.edu.cn/userinit.do"

headers

[

"Content-Type"

]

"application/x-www-form-urlencoded"

headers

[

"Cookie"

]

getcookie

for

range

(

number

)

req

urllib2

Request

(

url

"http://mis.teach.ustc.edu.cn/randomImage.do?date="1469451446894""

headers

#請求頭

)

response

urllib2

urlopen

(

req

)

status

response

getcode

()

picData

response

read

()

status

200

localPic

open

(

"./source/"

str

(

)

".jpg"

"wb"

)

localPic

write

(

picData

)

localPic

()

else

"failed to get Check Code "

__name__

"__main__"

getchk

(

500

)

這裡下載了500張驗證碼到source目錄下面。如圖：

二值化

matlab豐富的圖像處理函數能給我們省下很多時間，，我們遍歷source文件夾，對每一張驗證碼圖片進行二值化處理，把處理過的圖片存入bw目錄下。代碼如下

mydir

"./source/"

;

"./bw/"

;

mydir

(

end

)

mydir=[mydir,"

];

end

DIRS

dir

([

mydir

"*.jpg"

]);

擴展名

length

(

DIRS

);

for

DIRS

(

isdir

img

imread

(

strcat

(

mydir

DIRS

(

name

));

img

rgb2gray

(

img

);

灰度化

img

im2bw

(

img

);

二值化

name

strcat

(

DIRS

(

name

)

imwrite

(

img

name

);

end

處理結果如圖：

分割

mydir

"./bw/"

;

letter

"./letter/"

;

mydir

(

end

)

mydir=[mydir,"

];

end

DIRS

dir

([

mydir

"*.jpg"

]);

擴展名

length

(

DIRS

);

for

DIRS

(

isdir

img

imread

(

strcat

(

mydir

DIRS

(

name

));

img

im2bw

(

img

);

二值化

img

;

顏色反轉讓字元成為聯通域，方便去除噪點

for

region

[

ii*

];

把一張驗證碼分成四個

大小的字元圖片

subimg

imcrop

(

img

region

);

imlabel

bwlabel

(

subimg

);

imshow

(

imlabel

);

max

(

max

(

imlabel

))

說明有噪點，要去除

max

(

max

(

imlabel

))

imshow

(

subimg

);

stats

regionprops

(

imlabel

"Area"

);

area

cat

(

stats

Area

);

maxindex

find

(

area

max

(

area

));

area

(

maxindex

)

;

secondindex

find

(

area

max

(

area

));

imindex

ismember

(

imlabel

secondindex

);

subimg

(

imindex

)

;

去掉第二大連通域，噪點不可能比字元大，所以第二大的就是噪點

end

name

strcat

(

letter

DIRS

(

name

(

length

(

DIRS

(

name

)

"_"

num2str

(

".jpg"

)

imwrite

(

subimg

name

);

end

處理結果如圖：

旋轉

接下來進行旋轉，哪找一個什麼標準呢？據觀察，這些字元旋轉不超過60度，那麼在正負60度之間，統一旋轉至字元寬度最小就行了。代碼如下

mydir

(

end

)

mydir=[mydir,"

];

end

DIRS

dir

([

mydir

"*.jpg"

]);

擴展名

length

(

DIRS

);

for

DIRS

(

isdir

img

imread

(

strcat

(

mydir

DIRS

(

name

));

img

im2bw

(

img

);

minwidth

;

for

angle

= -

imgr

imrotate

(

img

angle

"bilinear"

"crop"

);

crop

避免圖像大小變化

imlabel

bwlabel

(

imgr

);

stats

regionprops

(

imlabel

"Area"

);

area

cat

(

stats

Area

);

maxindex

find

(

area

max

(

area

));

imindex

ismember

(

imlabel

maxindex

);

最大連通域為

[

]

find

(

imindex

);

width

max

(

)

min

(

)

;

width

minwidth

width

;

imgrr

imgr

;

end

name

strcat

(

rotate

DIRS

(

name

)

imwrite

(

imgrr

name

);

end

處理結果如圖，一共2000個字元的圖片存在rotate文件夾中

模板選取

現在從rotate文件夾中選取一套模板，涵蓋每一個字元，一個字元可以選取多個圖片，因為即使有前面的諸多處理也不能保證一個字元的最終呈現形式只有一種，多選幾個才能保證覆蓋率。把選出來的模板圖片存入samples文件夾下，這個過程很耗時耗力。可以找同學幫忙~，如圖

測試

測試代碼如下：首先對測試驗證碼進行上述操作，然後和選出來的模板進行比較，採用差分值最小的模板作為測試樣本的字元選擇，代碼如下

具有差分最小值的圖作為答案

mydir

"./test/"

;

samples

"./samples/"

;

mydir

(

end

)

mydir=[mydir,"

];

end

samples

(

end

)

samples=[samples,"

];

end

DIRS

dir

([

mydir

"*.jpg"

]);

擴展

DIRS1

dir

([

samples

"*.jpg"

]);

擴展名

length

(

DIRS

);

驗證碼總圖數

singleerror

;

單個錯誤

uniterror

;

一張驗證碼錯誤個數

for

DIRS

(

isdir

realcodes

DIRS

(

name

(

);

fprintf

(

"驗證碼實際字元:%s
"

realcodes

);

img

imread

(

strcat

(

mydir

DIRS

(

name

));

img

rgb2gray

(

img

);

img

im2bw

(

img

);

img

;

顏色反轉讓字元成為聯通域

subimgs

[];

for

region

[

];

奇怪

為什麼這樣才能均分？

subimg

imcrop

(

img

region

);

imlabel

bwlabel

(

subimg

);

max

(

max

(

imlabel

))

說明有雜點

stats

regionprops

(

imlabel

"Area"

);

area

cat

(

stats

Area

);

maxindex

find

(

area

max

(

area

));

area

(

maxindex

)

;

secondindex

find

(

area

max

(

area

));

imindex

ismember

(

imlabel

secondindex

);

subimg

(

imindex

)

;

去掉第二大連通域

end

subimgs

[

subimgs

;

subimg

];

end

codes

[];

for

region

[

];

subimg

imcrop

(

img

region

);

minwidth

;

for

angle

= -

imgr

imrotate

(

subimg

angle

"bilinear"

"crop"

);

crop

避免圖像大小變化

imlabel

bwlabel

(

imgr

);

stats

regionprops

(

imlabel

"Area"

);

area

cat

(

stats

Area

);

maxindex

find

(

area

max

(

area

));

imindex

ismember

(

imlabel

maxindex

);

最大連通域為

[

]

find

(

imindex

);

width

max

(

)

min

(

)

;

width

minwidth

width

;

imgrr

imgr

;

end

mindiffv

1000000

;

for

length

(

DIRS1

)

imgsample

imread

(

strcat

(

samples

DIRS1

(

name

));

imgsample

im2bw

(

imgsample

);

diffv

abs

(

imgsample

imgrr

);

alldiffv

sum

(

sum

(

diffv

));

alldiffv

mindiffv

alldiffv

;

code

DIRS1

(

name

;

code

(

);

end

codes

[

codes

code

];

end

fprintf

(

"驗證碼測試字元:%s
"

codes

);

num

codes

realcodes

;

num

length

(

find

(

num

));

singleerror

num

;

num

uniterror

;

end

fprintf

(

"錯誤個數:%d
"

num

);

end

fprintf

(

"
-----結果統計如下-----

);

fprintf

(

"測試驗證碼的字元數量:%d
"

);

fprintf

(

"測試驗證碼的字元錯誤數量:%d
"

singleerror

);

fprintf

(

"單個字元識別正確率:%.2f%%
"

singleerror

(

))

100

);

fprintf

(

"測試驗證碼圖的數量:%d
"

);

fprintf

(

"測試驗證碼圖的錯誤數量:%d
"

uniterror

);

fprintf

(

"填對驗證碼的概率:%.2f%%
"

uniterror

)

100

);

結果：

驗證碼實際字元

2B4E

驗證碼測試字元

2B4F

錯誤個數

驗證碼實際字元

4572

驗證碼測試字元

4572

錯誤個數

驗證碼實際字元

52CY

驗證碼測試字元

52LY

錯誤個數

驗證碼實際字元

83QG

驗證碼測試字元

85QG

錯誤個數

驗證碼實際字元

9992

驗證碼測試字元

9992

錯誤個數

驗證碼實際字元

A7Y7

驗證碼測試字元

A7Y7

錯誤個數

驗證碼實際字元

D993

驗證碼測試字元

D995

錯誤個數

驗證碼實際字元

F549

驗證碼測試字元

F5A9

錯誤個數

驗證碼實際字元

FMC6

驗證碼測試字元

FMLF

錯誤個數

驗證碼實際字元

R4N4

驗證碼測試字元

R4N4

錯誤個數

-----

結果統計如下

-----

測試驗證碼的字元數量

測試驗證碼的字元錯誤數量

單個字元識別正確率

82.50

測試驗證碼圖的數量

測試驗證碼圖的錯誤數量

填對驗證碼的概率

40.00

可見單個字元準確率是比較高的的了，但是綜合準確率還是不行，觀察結果至，錯誤的字元就是那些易混淆字元，比如E和F,C和L,5和3，4和A等，所以我們能做的事就是增加模板中的樣本數量，以期盡量減少混淆。

增加了幾十個樣本過後再次試驗，結果：

驗證碼實際字元

2B4E

驗證碼測試字元

2B4F

錯誤個數

驗證碼實際字元

4572

驗證碼測試字元

4572

錯誤個數

驗證碼實際字元

52CY

驗證碼測試字元

52LY

錯誤個數

驗證碼實際字元

83QG

驗證碼測試字元

83QG

錯誤個數

驗證碼實際字元

9992

驗證碼測試字元

9992

錯誤個數

驗證碼實際字元

A7Y7

驗證碼測試字元

A7Y7

錯誤個數

驗證碼實際字元

D993

驗證碼測試字元

D993

錯誤個數

驗證碼實際字元

F549

驗證碼測試字元

F5A9

錯誤個數

驗證碼實際字元

FMC6

驗證碼測試字元

FMLF

錯誤個數

驗證碼實際字元

R4N4

驗證碼測試字元

R4N4

錯誤個數

-----

結果統計如下

-----

測試驗證碼的字元數量

測試驗證碼的字元錯誤數量

單個字元識別正確率

87.50

測試驗證碼圖的數量

測試驗證碼圖的錯誤數量

填對驗證碼的概率

60.00

可見無論是單個字元識別正確率還是整個驗證碼正確的概率都有了提升。能夠預見：隨著模板數量的增多，正確率會不斷地提高。

總結

這種方法的可擴展性很弱，而且只適用於簡單的驗證碼，12306那種根本就別提了。

總之就是學習的道路還很長，我會慢慢的改善這種方法的。

看完本文有收穫？請轉

發分享給更多人

關注「P

ython開發者」，提升Python技能

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 Python開發者 的精彩文章:

※面向對象：認真積極但絕不將就地找到那個他，相信會有奇妙的事件發生
※恭喜 Python，基本是各年齡段開發者的最愛！

TAG:Python開發者 |