Scrapy詳解之中間件(Middleware)
點擊上方「
Python開發
」,選擇「置頂公眾號」
關鍵時刻,第一時間送達!
作者:zarten,互聯網一線工作者
地址:zhihu.com/people/zarten
python開發整理髮布,轉載請聯繫作者獲得授權
概述
下載器中間件(Downloader Middleware)
如上圖標號4、5處所示,下載器中間件用於處理scrapy的request和response的鉤子框架,可以全局的修改一些參數,如代理ip,header等
使用下載器中間件時必須激活這個中間件,方法是在settings.py文件中設置DOWNLOADER_MIDDLEWARES這個字典,格式類似如下:
"myproject.middlewares.Custom_A_DownloaderMiddleware" 543DOWNLOADERMIDDLEWARES = {
,
"myproject.middlewares.Custom_B_DownloaderMiddleware"
:643
,"myproject.middlewares.Custom_B_DownloaderMiddleware"
:None
,}
數字越小,越靠近引擎,數字越大越靠近下載器,所以數字越小的,
processrequest()
優先處理;數字越大的,
process_response()
優先處理;若需要關閉某個中間件直接設為
None
即可
自定義下載器中間件
有時我們需要編寫自己的一些下載器中間件,如使用代理,更換
user-agent
等,對於請求的中間件實現
process_request(request, spider)
;對於處理回復中間件實現
process_response(request, response, spider)
;以及異常處理實現
process_exception(request, exception, spider)
process_request(request, spider)
每當scrapy進行一個request請求時,這個方法被調用。通常它可以返回
1.None
2.Response對象
3.Request對象
4.拋出IgnoreRequest對象
通常返回None較常見,它會繼續執行爬蟲下去。其他返回情況參考這裡
例如下面2個例子是更換user-agent和代理ip的下載中間件
user-agent中間件
from import
class
UserAgent_Middleware()
:
def
process_request(self, request, spider)
: f = Faker()
agent = f.firefox()
request.headers[
"User-Agent"
] = agent代理ip中間件
class Proxy_Middleware()
def
process_request(self, request, spider)
:
try
:
xdaili_url = spider.settings.get(
"XDAILI_URL"
) r = requests.get(xdaili_url)
proxy_ip_port = r.text
request.meta[
"proxy"
] ="https://"
+ proxy_ip_portexcept
requests.exceptions.RequestException:print(
"獲取訊代理ip失敗!"
)spider.logger.error(
"獲取訊代理ip失敗!"
)from import from import from import from import
class
ChromeDownloaderMiddleware(object)
:
def
__init__(self)
:options = webdriver.ChromeOptions()
options.add_argument(
"--headless"
)# 設置無界面
if
CHROME_PATH:options.binary_location = CHROME_PATH
if
CHROME_DRIVER_PATH:self.driver = webdriver.Chrome(chrome_options=options, executable_path=CHROME_DRIVER_PATH)
# 初始化Chrome驅動
else
:self.driver = webdriver.Chrome(chrome_options=options)
# 初始化Chrome驅動
def
__del__(self)
:self.driver.close()
def
process_request(self, request, spider)
:try
:print(
"Chrome driver begin..."
)self.driver.get(request.url)
# 獲取網頁鏈接內容
return
HtmlResponse(url=request.url, body=self.driver.page_source, request=request, encoding="utf-8"
,status=
200
)# 返回HTML數據
except
TimeoutException:return
HtmlResponse(url=request.url, request=request, encoding="utf-8"
, status=500
)finally
:print(
"Chrome driver end..."
)
process_response(request, response, spider)
當請求發出去返回時這個方法會被調用,它會返回
1.若返回Response對象,它會被下個中間件中的process_response()處理
2.若返回Request對象,中間鏈停止,然後返回的Request會被重新調度下載
3.拋出IgnoreRequest,回調函數 Request.errback將會被調用處理,若沒處理,將會忽略
process_exception(request, exception, spider)
當下載處理模塊或process_request()拋出一個異常(包括IgnoreRequest異常)時,該方法被調用
通常返回None,它會一直處理異常
from_crawler(cls, crawler)
這個類方法通常是訪問settings和signals的入口函數
@classmethod return get "MYSQL_HOST" get "MYSQL_DB" get "MYSQL_USER" get "MYSQL_PW"
def from_crawler(cls, crawler):
mysql_host = crawler.settings.
mysql_db = crawler.settings.
mysql_user = crawler.settings.
mysql_pw = crawler.settings.
)
scrapy自帶下載器中間件
以下中間件是scrapy默認的下載器中間件
"scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware" "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware" "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware" "scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware" "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware" "scrapy.downloadermiddlewares.retry.RetryMiddleware" "scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware" "scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware" "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware" "scrapy.downloadermiddlewares.redirect.RedirectMiddleware" "scrapy.downloadermiddlewares.cookies.CookiesMiddleware" "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware" "scrapy.downloadermiddlewares.stats.DownloaderStats" "scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware"{
}
scrapy自帶中間件請參考這裡
Spider中間件(Spider Middleware)
如文章第一張圖所示,spider中間件用於處理response及spider生成的item和Request
啟動spider中間件必須先開啟settings中的設置
"myproject.middlewares.CustomSpiderMiddleware" 543 "scrapy.spidermiddlewares.offsite.OffsiteMiddleware" NoneSPIDER_MIDDLEWARES = {
}
數字越小越靠近引擎,process_spider_input()優先處理,數字越大越靠近spider,process_spider_output()優先處理,關閉用None
編寫自定義spider中間件
process_spider_input(response, spider)
當response通過spider中間件時,這個方法被調用,返回None
process_spider_output(response, result, spider)
當spider處理response後返回result時,這個方法被調用,必須返回Request或Item對象的可迭代對象,一般返回result
process_spider_exception(response, exception, spider)
當spider中間件拋出異常時,這個方法被調用,返回None或可迭代對象的Request、dict、Item
【點擊成為Java大神】


※python爬蟲實戰,干翻一個網站,爬取資源鏈接並用多線程下載!
※一個專屬聊天軟體開發,python碼農賺了三千塊!
TAG:Python開發 |