Scrapy教程——博客園前3000名文章列表抓取

知識 06-13

一、前3000名人員列表頁

1）進入首頁，找到博客園積分列表。如下圖：然後我們就找到前3000名大神的博客地址了。通過，詞雲分析了下，好多大神的博客都遷移到個人博客上了。

2）分析頁面結構：每一個td都是，一個人員。

第一個small為排名

第二個a標籤是昵稱和用戶名，以及首頁的博客地址。用戶名通過地址截取獲取

第四個small標籤是，博客數量以及積分，通過字元串分離後可以逐個獲取到。

3）代碼：使用xpath獲取標籤及相關的內容，獲取到首頁博客地址後，發送請求。

def parse(self, response):
for i in response.xpath("//table[@width="90%"]//td"):
top = i.xpath(
"./small[1]/text").extract[0].split(".")[-2].strip
nickName = i.xpath("./a[1]//text").extract[0].strip
userName = i.xpath(
"./a[1]/@href").extract[0].split("/")[-2].strip
totalAndScore = i.xpath(
"./small[2]//text").extract[0].lstrip("(").rstrip(")").split(",")
total = totalAndScore[0].strip
score = totalAndScore[2].strip
# print(top)
# print(nickName)
# print(userName)
# print(total)
# print(score)
# return
yield scrapy.Request(i.xpath("./a[1]/@href").extract[0], meta={"page": 1, "top": top, "nickName": nickName, "userName": userName, "score": score},
callback=self.parse_page)

二、各人員博客列表頁

1）頁面結構：通過分析，每篇博客的a標籤id中都包含「TitleUrl」，這樣就可以獲取到每篇博客的地址了。每頁面地址，加上default.html?page=2，page跟著變動就可以了。

Scrapy教程——博客園前3000名文章列表抓取

2）代碼：置頂的文字會去除掉。

def parse_page(self, response):
# print(response.meta["nickName"])
#//a[contains(@id,"TitleUrl")]
urlArr = response.url.split("default.aspx?")
if len(urlArr) > 1:
baseUrl = urlArr[-2]
else:
baseUrl = response.url
list = response.xpath("//a[contains(@id,"TitleUrl")]")
for i in list:
item = CnblogsItem
item["top"] = int(response.meta["top"])
item["nickName"] = response.meta["nickName"]
item["userName"] = response.meta["userName"]
item["score"] = int(response.meta["score"])
item["pageLink"] = response.url
item["title"] = i.xpath(
"./text").extract[0].replace(u"[置頂]", "").strip
item["articleLink"] = i.xpath("./@href").extract[0]
yield item
if len(list) > 0:
response.meta["page"] += 1
yield scrapy.Request(baseUrl + "default.aspx?page=" + str(response.meta["page"]), meta={"page": response.meta["page"], "top": response.meta["top"], "nickName": response.meta["nickName"], "userName": response.meta["userName"], "score": response.meta["score"]}, callback=self.parse_page)

3）對於每篇博客的內容，這裡沒有抓取。也很簡單，分析頁面。繼續發送請求，找到id為cnblogs_post_body的div就可以了。

Scrapy教程——博客園前3000名文章列表抓取

三、數據存儲MongoDB

這一部分沒什麼難的。記著安裝pymongo，pip install pymongo。總共有80+萬篇文章。

from cnblogs.items import CnblogsItem
import pymongo

class CnblogsPipeline(object):

def __init__(self):
client = pymongo.MongoClient(host="127.0.0.1", port=27017)
dbName = client["cnblogs"]
self.table = dbName["articles"]
self.table.create

def process_item(self, item, spider):
if isinstance(item, CnblogsItem):
self.table.insert(dict(item))
return item

Scrapy教程——博客園前3000名文章列表抓取

四、代理及Model類

scrapy中的代理，很簡單，自定義一個下載中間件，指定一下代理ip和埠就可以了。

def process_request(self, request, spider):
request.meta["proxy"] = "http://117.143.109.173:80"

Model類，存放的是對應的欄位。

class CnblogsItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field
# 排名
top = scrapy.Field
nickName = scrapy.Field
userName = scrapy.Field
# 積分
score = scrapy.Field
# 所在頁碼地址
pageLink = scrapy.Field
# 文章標題
title = scrapy.Field
# 文章鏈接
articleLink = scrapy.Field

五、wordcloud詞雲分析

對每個人的文章進行詞雲分析，存儲為圖片。wordcloud的使用用，可參考園內文章。

這裡用了多線程，一個線程用來生成分詞好的txt文本，一個線程用來生成詞雲圖片。生成詞雲大概，1秒一個。

Scrapy教程——博客園前3000名文章列表抓取

# coding=utf-8
import sys
import jieba
from wordcloud import WordCloud
import pymongo
import threading
from Queue import Queue
import datetime
import os
reload(sys)
sys.setdefaultencoding("utf-8")

class MyThread(threading.Thread):

def __init__(self, func, args):
threading.Thread.__init__(self)
self.func = func
self.args = args

def run(self):
apply(self.func, self.args)
# 獲取內容線程

def getTitle(queue, table):
for j in range(1, 3001):
# start = datetime.datetime.now
list = table.find({"top": j}, {"title": 1, "top": 1, "nickName": 1})
if list.count == 0:
continue
txt = ""
for i in list:
txt += str(i["title"]) + "
"
name = i["nickName"]
top = i["top"]
txt = " ".join(jieba.cut(txt))
queue.put((txt, name, top), 1)
# print((datetime.datetime.now - start).seconds)

def getImg(queue, word):
for i in range(1, 3001):
# start = datetime.datetime.now
get = queue.get(1)
word.generate(get[0])
name = get[1].replace("<", "").replace(">", "").replace("/", "").replace("\", "").replace(
"|", "").replace(":", "").replace(""", "").replace("*", "").replace("?", "")
word.to_file(
"wordcloudimgs/" + str(get[2]) + "-" + str(name).decode("utf-8") + ".jpg")
print(str(get[1]).decode("utf-8") + " 生成成功")
# print((datetime.datetime.now - start).seconds)

def main:
client = pymongo.MongoClient(host="127.0.0.1", port=27017)
dbName = client["cnblogs"]
table = dbName["articles"]
wc = WordCloud(
font_path="msyh.ttc", background_color="#ccc", width=600, height=600)
if not os.path.exists("wordcloudimgs"):
os.mkdir("wordcloudimgs")
threads =
queue = Queue
titleThread = MyThread(getTitle, (queue, table))
imgThread = MyThread(getImg, (queue, wc))
threads.append(imgThread)
threads.append(titleThread)

for t in threads:
t.start
for t in threads:
t.join

if __name__ == "__main__":
main

六、完整源碼地址

https://github.com/hao15239129517/cnblogs

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 達人科技 的精彩文章:

※Ajax的用法總結
※hibernate操作步驟（代碼部分）
※最基礎的mybatis入門demo

TAG:達人科技 |

您可能感興趣

※2017年10篇優秀的Python文章
※2017 年關於 Python 案例的 Top45 文章
※3篇10分文章，10x Genomics帶你玩轉腫瘤研究
※無標題文章2018年保時捷Panamera首次駕駛評論
※python 抓取開源中國上閱讀數大於 1000 的優質文章
※Framingham心臟研究70周年紀念：已延續三代人，發表3698篇文章
※一篇文章帶你看完iOS 12 beta 5 的20個新變化
※微博iOS版8.12.0更新：發布文章可再編輯
※10w＋文章分析
※我們從1400篇機器學習文章中挑出了Top 10
※5月Python好文TOP 10新鮮出爐，精選自1000篇文章，你都看了嗎？
※2017年文章總列表
※Zero to Hero：2017年機器之心AI高分概述文章全集
※《知遠防務評論》2019No.05文章目錄及摘要
※熱門文章，↓↓閱讀100000+
※LEADS課題組2篇文章被ISCAS 2018錄用
※這個熱點1周連發3篇高分文章，1篇Nature，1篇Nature子刊
※No.670 | 1986年的一篇《人民日報》評論員文章
※2017年最受歡迎文章TOP10
※LEADS課題組4篇文章被ICASSP 2018錄用