Python爬蟲實戰：《戰狼2》豆瓣影評分析

知識 04-19

點擊上方

「

Python開發

」，選擇「置頂公眾號」

關鍵時刻，第一時間送達！

剛接觸python不久，做一個小項目來練練手。前幾天看了《戰狼2》，發現它在最新上映的電影裡面是排行第一的，如下圖所示。準備把豆瓣上對它的影評做一個分析。

目標總覽

主要做了三件事：

抓取網頁數據

清理數據

用詞雲進行展示
使用的python版本是3.5.

一、抓取網頁數據

第一步要對網頁進行訪問，python中使用的是urllib庫。代碼如下：

from
urllib
import
request
resp
=
request
.
urlopen
(
"https://movie.douban.com/nowplaying/hangzhou/"
)
html_data
= resp . read (). decode ( "utf-8" )

其中https://movie.douban.com/nowp...是豆瓣最新上映的電影頁面，可以在瀏覽器中輸入該網址進行查看。 html_data是字元串類型的變數，裡面存放了網頁的html代碼。輸入

print

(

html_data

)

可以查看，如下圖所示：

第二步，需要對得到的html代碼進行解析，得到裡面提取我們需要的數據。

在python中使用BeautifulSoup庫進行html代碼的解析。（註：如果沒有安裝此庫，則使用

pip install

BeautifulSoup

進行安裝即可！） BeautifulSoup使用的格式如下：

BeautifulSoup
(
html
,
"html.parser"
)

第一個參數為需要提取數據的html，第二個參數是指定解析器，然後使用

find_all

()

讀取html標籤中的內容。

但是html中有這麼多的標籤，該讀取哪些標籤呢？其實，最簡單的辦法是我們可以打開我們爬取網頁的html代碼，然後查看我們需要的數據在哪個html標籤裡面，再進行讀取就可以了。如下圖所示：

從上圖中可以看出在

div id

=

"nowplaying"

標籤開始是我們想要的數據，裡面有電影的名稱、評分、主演等信息。所以相應的代碼編寫如下：

from
bs4
import

BeautifulSoup as bs
soup
=
bs
(
html_data
,

"html.parser"
)
nowplaying_movie
=
soup
.
find_all
(
"div"
,
id
=
"nowplaying"
)
nowplaying_movie_list
=
nowplaying_movie
[
0
].
find_all
(
"li"
,
class_
=
"list-item"
)

其中

nowplaying_movie_list

是一個列表，可以用

print
(
nowplaying_movie_list
[
0
])

查看裡面的內容，如下圖所示：

在上圖中可以看到data-subject屬性裡面放了電影的id號碼，而在img標籤的alt屬性裡面放了電影的名字，因此我們就通過這兩個屬性來得到電影的id和名稱。（註：打開電影短評的網頁時需要用到電影的id，所以需要對它進行解析），編寫代碼如下：

nowplaying_list
=

[]
for
item
in
nowplaying_movie_list
:
nowplaying_dict
=

{}
nowplaying_dict
[
"id"
]

=
item
[
"data-subject"
]
for
tag_img_item
in
item
.
find_all
(
"img"
):
nowplaying_dict
[
"name"
]

=
tag_img_item
[
"alt"
]
nowplaying_list
.
append
(
nowplaying_dict
)

其中列表nowplaying_list中就存放了最新電影的id和名稱，可以使用

print

(

nowplaying_list

)

進行查看，如下圖所示：

可以看到和豆瓣網址上面是匹配的。這樣就得到了最新電影的信息了。接下來就要進行對最新電影短評進行分析了。例如《戰狼2》的短評網址為：

https

:

//movie.douban.com/subject/26363254/comments?start=0&limit=20

其中 26363254就是電影的id，

start
=
0

表示評論的第0條評論。

接下來接對該網址進行解析了。打開上圖中的短評頁面的html代碼，我們發現關於評論的數據是在

div

標籤的 comment屬性下面，如下圖所示：

因此對此標籤進行解析，代碼如下：

requrl
=

"https://movie.douban.com/subject/"

+
nowplaying_list
[
0
][
"id"
]

+

"/comments"

+
"?"

+
"start=0"

+

"&limit=20"
resp
=
request
.
urlopen
(
requrl
)
html_data
=
resp
.
read
().
decode
(
"utf-8"
)
soup
=
bs
(
html_data
,

"html.parser"
)
comment_div_lits
=
soup
.
find_all
(
"div"
,
class_
=
"comment"
)

此時在

comment_div_lits

列表中存放的就是div標籤和comment屬性下面的html代碼了。在上圖中還可以發現在p標籤下面存放了網友對電影的評論，如下圖所示:

因此對

comment_div_lits

代碼中的html代碼繼續進行解析，代碼如下：

eachCommentList
=

[];
for
item
in
comment_div_lits
:
if
item
.
find_all
(
"p"
)[
0
].
string

is

not

None
:
eachCommentList
.
append
(
item
.
find_all
(
"p"
)[
0
].
string
)

使用

print

(

eachCommentList

)

查看eachCommentList列表中的內容，可以看到裡面存里我們想要的影評。如下圖所示：

好的，至此我們已經爬取了豆瓣最近播放電影的評論數據，接下來就要對數據進行清洗和詞雲顯示了。

二、數據清洗

為了方便進行數據進行清洗，我們將列表中的數據放在一個字元串數組中，代碼如下：

comments
=

""
for
k
in
range
(
len
(
eachCommentList
)):
comments
=
comments
+

(
str
(
eachCommentList
[
k
])).
strip
()

使用

print

(

comments

)

進行查看，如下圖所示：

可以看到所有的評論已經變成一個字元串了，但是我們發現評論中還有不少的標點符號等。這些符號對我們進行詞頻統計時根本沒有用，因此要將它們清除。所用的方法是正則表達式。python中正則表達式是通過re模塊來實現的。代碼如下：

import
re
pattern
=
re
.
compile
(
r
"[u4e00-u9fa5]+"
)
filterdata
=
re
.
findall
(
pattern
,
comments
)
cleaned_comments
=

""
.
join
(
filterdata
)

繼續使用

print

(

cleaned_comments

)

語句進行查看，如下圖所示：

我們可以看到此時評論數據中已經沒有那些標點符號了，數據變得"乾淨"了很多。

因此要進行詞頻統計，所以先要進行中文分詞操作。在這裡我使用的是結巴分詞。如果沒有安裝結巴分詞，可以在控制台使用

pip install jieba

進行安裝。（註：可以使用 pip list查看是否安裝了這些庫）。代碼如下所示：

import
jieba
#分詞包
import
pandas
as
pd
segment
=
jieba
.
lcut
(
cleaned_comments
)
words_df
=
pd
.
DataFrame
({
"segment"
:
segment
})

因為結巴分詞要用到pandas，所以我們這裡載入了pandas包。可以使用

words_df

.

head

()

查看分詞之後的結果，如下圖所示：

從上圖可以看到我們的數據中有"看"、"太"、"的"等虛詞（停用詞），而這些詞在任何場景中都是高頻時，並且沒有實際的含義，所以我們要他們進行清除。

我把停用詞放在一個

stopwords

.

txt

文件中，將我們的數據與停用詞進行比對即可（註：只要在百度中輸入

stopwords
.
txt

，就可以下載到該文件）。去停用詞代碼如下代碼如下：

stopwords
=
pd
.
read_csv
(
"stopwords.txt"
,
index_col
=
False
,
quoting
=
3
,
sep
=
"t"
,
names
=[
"stopword"
],
encoding
=
"utf-8"
)#
quoting
=
3
全不引用
words_df
=
words_df
[~
words_df
.
segment
.
isin
(
stopwords
.
stopword
)]

繼續使用

words_df

.

head

()

語句來查看結果，如下圖所示，停用詞已經被出去了。

接下來就要進行詞頻統計了，代碼如下：

import
numpy
#numpy計算包
words_stat
=
words_df
.
groupby
(
by
=[
"segment"
])[
"segment"
].
agg
({
"計數"
:
numpy
.
size
})
words_stat
=
words_stat
.
reset_index
().
sort_values
(
by
=[
"計數"
],
ascending
=
False
)

用

words_stat

.

head

()

進行查看，結果如下：

由於我們前面只是爬取了第一頁的評論，所以數據有點少，在最後給出的完整代碼中，我爬取了10頁的評論，所數據還是有參考價值。

三、用詞雲進行顯示

代碼如下：

import
matplotlib
.
pyplot
as
plt
%
matplotlib
inline
import
matplotlib
matplotlib
.
rcParams
[
"figure.figsize"
]

=

(
10.0
,

5.0
)
from
wordcloud
import

WordCloud
#詞雲包
wordcloud
=
WordCloud
(
font_path
=
"simhei.ttf"
,
background_color
=
"white"
,
max_font_size
=
80
)

#指定字體類型、字體大小和字體顏色
word_frequence
=

{
x
[
0
]:
x
[
1
]

for
x
in
words_stat
.
head
(
1000
).
values
}
word_frequence_list
=

[]
for
key
in
word_frequence
:
temp
=

(
key
,
word_frequence
[
key
])
word_frequence_list
.
append
(
temp
)
wordcloud
=
wordcloud
.
fit_words
(
word_frequence_list
)
plt
.
imshow
(
wordcloud
)

其中

simhei

.

ttf

使用來指定字體的，可以在百度上輸入

simhei
.
ttf

進行下載後，放入程序的根目錄即可。顯示的圖像如下：

到此為止，整個項目的介紹就結束了。由於自己也還是個初學者，接觸python不久，代碼寫的並不好。而且第一次寫技術博客，表達的有些冗餘，請大家多多包涵，有不對的地方，請大家批評指正。以後我也會將自己做的小項目以這種形式寫在博客上和大家一起交流！最後貼上完整的代碼。

完整代碼

#coding:utf-8
__author__
=

"hang"
import
warnings
warnings
.
filterwarnings
(
"ignore"
)
import
jieba
#分詞包
import
numpy
#numpy計算包
import
codecs
#codecs提供的open方法來指定打開的文件的語言編碼，它會在讀取的時候自動轉換為內部unicode
import
re
import
pandas
as
pd
import
matplotlib
.
pyplot
as
plt
from
urllib
import
request
from
bs4
import

BeautifulSoup

as
bs
%
matplotlib
inline
import
matplotlib
matplotlib
.
rcParams
[
"figure.figsize"
]

=

(
10.0
,

5.0
)
from
wordcloud
import

WordCloud
#詞雲包
#分析網頁函數
def
getNowPlayingMovie_list
():
resp
=
request
.
urlopen
(
"https://movie.douban.com/nowplaying/hangzhou/"
)
html_data
=
resp
.
read
().
decode
(
"utf-8"
)
soup
=
bs
(
html_data
,

"html.parser"
)
nowplaying_movie
=
soup
.
find_all
(
"div"
,
id
=
"nowplaying"
)
nowplaying_movie_list
=
nowplaying_movie
[
0
].
find_all
(
"li"
,
class_
=
"list-item"
)
nowplaying_list
=

[]
for
item
in
nowplaying_movie_list
:
nowplaying_dict
=

{}
nowplaying_dict
[
"id"
]

=
item
[
"data-subject"
]
for
tag_img_item
in
item
.
find_all
(
"img"
):
nowplaying_dict
[
"name"
]

=
tag_img_item
[
"alt"
]
nowplaying_list
.
append
(
nowplaying_dict
)
return
nowplaying_list
#爬取評論函數
def
getCommentsById
(
movieId
,
pageNum
):
eachCommentList
=

[];
if
pageNum
>
0
:
start
=

(
pageNum
-
1
)

*

20
else
:
return

False
requrl
=

"https://movie.douban.com/subject/"

+
movieId
+

"/comments"

+
"?"

+
"start="

+
str
(
start
)

+

"&limit=20"
print
(
requrl
)
resp
=
request
.
urlopen
(
requrl
)
html_data
=
resp
.
read
().
decode
(
"utf-8"
)
soup
=
bs
(
html_data
,

"html.parser"
)
comment_div_lits
=
soup
.
find_all
(
"div"
,
class_
=
"comment"
)
for
item
in
comment_div_lits
:
if
item
.
find_all
(
"p"
)[
0
].
string

is

not

None
:
eachCommentList
.
append
(
item
.
find_all
(
"p"
)[
0
].
string
)
return
eachCommentList
def
main
():
#循環獲取第一個電影的前10頁評論
commentList
=

[]
NowPlayingMovie_list

=
getNowPlayingMovie_list
()
for
i
in
range
(
10
):
num
=
i
+

1
commentList_temp
=
getCommentsById
(
NowPlayingMovie_list
[
0
][
"id"
],
num
)
commentList
.
append
(
commentList_temp
)
#將列表中的數據轉換為字元串
comments
=

""
for
k
in
range
(
len
(
commentList
)):
comments
=
comments
+

(
str
(
commentList
[
k
])).
strip
()
#使用正則表達式去除標點符號
pattern
=
re
.
compile
(
r
"[u4e00-u9fa5]+"
)
filterdata
=
re
.
findall
(
pattern
,
comments
)
cleaned_comments
=

""
.
join
(
filterdata
)
#使用結巴分詞進行中文分詞
segment
=
jieba
.
lcut
(
cleaned_comments
)
words_df
=
pd
.
DataFrame
({
"segment"
:
segment
})
#去掉停用詞
stopwords
=
pd
.
read_csv
(
"stopwords.txt"
,
index_col
=
False
,
quoting
=
3
,
sep
=
"t"
,
names
=[
"stopword"
],
encoding
=
"utf-8"
)#
quoting
=
3
全不引用
words_df
=
words_df
[~
words_df
.
segment
.
isin
(
stopwords
.
stopword
)]
#統計詞頻
words_stat
=
words_df
.
groupby
(
by
=[
"segment"
])[
"segment"
].
agg
({
"計數"
:
numpy
.
size
})
words_stat
=
words_stat
.
reset_index
().
sort_values
(
by
=[
"計數"
],
ascending
=
False
)
#用詞雲進行顯示
wordcloud
=
WordCloud
(
font_path
=
"simhei.ttf"
,
background_color
=
"white"
,
max_font_size
=
80
)
word_frequence
=

{
x
[
0
]:
x
[
1
]

for
x
in
words_stat
.
head
(
1000
).
values
}
word_frequence_list
=

[]
for
key
in
word_frequence
:
temp
=

(
key
,
word_frequence
[
key
])
word_frequence_list
.
append
(
temp
)
wordcloud
=
wordcloud
.
fit_words
(
word_frequence_list
)
plt
.
imshow
(
wordcloud
)
#主函數
main
()