Python 爬蟲神器 PyQuery 的使用方法

知識 06-08

（點擊

上方藍字

，快速關注我們）

來源：xiaomayi2012

segmentfault.com/a/1190000005182997

如有好文章投稿，請點擊 → 這裡了解詳情

前言

你是否覺得 XPath 的用法多少有點晦澀難記呢？

你是否覺得 BeautifulSoup 的語法多少有些慳吝難懂呢？

你是否甚至還在苦苦研究正則表達式卻因為少些了一個點而抓狂呢？

你是否已經有了一些前端基礎了解選擇器卻與另外一些奇怪的選擇器語法混淆了呢？

嗯，那麼，前端大大們的福音來了，PyQuery 來了，乍聽名字，你一定聯想到了 jQuery，如果你對 jQuery 熟悉，那麼 PyQuery 來解析文檔就是不二之選！包括我在內！

PyQuery 是 Python 仿照 jQuery 的嚴格實現。語法與 jQuery 幾乎完全相同，所以不用再去費心去記一些奇怪的方法了。

天下竟然有這等好事？我都等不及了！

安裝

有這等神器還不趕緊安裝了！來！

pip install pyquery

參考來源

本文內容參考官方文檔，更多內容，大家可以去官方文檔學習，畢竟那裡才是最原汁原味的。

目前版本 1.2.4 (2016/3/24)

官方文檔 (https://pythonhosted.org/pyquery/)

簡介

pyquery allows you to make jquery queries on xml documents. The API is

as much as possible the similar to jquery. pyquery uses lxml for fast

xml and html manipulation. This is not (or at least not yet) a library

to produce or interact with javascript code. I just liked the jquery

API and I missed it in python so I told myself 「Hey let』s make jquery

in python」. This is the result. It can be used for many purposes, one

idea that I might try in the future is to use it for templating with

pure http templates that you modify using pyquery. I can also be used

for web scrapping or for theming applications with Deliverance.

pyquery 可讓你用 jQuery 的語法來對 xml 進行操作。這I和 jQuery 十分類似。如果利用 lxml，pyquery 對 xml 和 html 的處理將更快。

這個庫不是（至少還不是）一個可以和 JavaScript交互的代碼庫，它只是非常像 jQuery API 而已。

初始化

在這裡介紹四種初始化方式。

（1）直接字元串

from

pyquery

import

PyQuery

doc

(

)

pq 參數可以直接傳入 HTML 代碼，doc 現在就相當於 jQuery 裡面的 $ 符號了。

（2）lxml.etree

from

lxml

import

etree

doc

(

etree

fromstring

(

))

可以首先用 lxml 的 etree 處理一下代碼，這樣如果你的 HTML 代碼出現一些不完整或者疏漏，都會自動轉化為完整清晰結構的 HTML代碼。

（3）直接傳URL

from pyquery import PyQuery

doc

(

"http://www.baidu.com"

)

這裡就像直接請求了一個網頁一樣，類似用 urllib2 來直接請求這個鏈接，得到 HTML 代碼。

（4）傳文件

from pyquery import PyQuery

doc

(

filename

"hello.html"

)

可以直接傳某個路徑的文件名。

快速體驗

現在我們以本地文件為例，傳入一個名字為 hello.html 的文件，文件內容為

class
=
"item-0"
>
first item

class
=
"item-1"
>
href
=
"link2.html"
>
second item

class
=
"item-0 active"
>
href
=
"link3.html"
>
class
=
"bold"
>
third item

class
=
"item-1 active"
>
href
=
"link4.html"
>
fourth item

class
=
"item-0"
>
href
=
"link5.html"
>
fifth item

編寫如下程序

from pyquery import PyQuery

doc

(

filename

"hello.html"

)

doc

html

()

print type

(

doc

)

doc

(

"li"

)

print type

(

)

text

()

運行結果

class

"item-0"

first

item

class

"item-1"

href

"link2.html"

second

item

class

"item-0 active"

href

"link3.html"

span

class

"bold"

third

item

span

class

"item-1 active"

href

"link4.html"

fourth

item

class

"item-0"

href

"link5.html"

fifth

item

class

"pyquery.pyquery.PyQuery"

class

"pyquery.pyquery.PyQuery"

first item second item third item fourth item fifth

item

看，回憶一下 jQuery 的語法，是不是運行結果都是一樣的呢？

在這裡我們注意到了一點，PyQuery 初始化之後，返回類型是 PyQuery，利用了選擇器篩選一次之後，返回結果的類型依然還是 PyQuery，這簡直和 jQuery 如出一轍，不能更贊！然而想一下 BeautifulSoup 和 XPath 返回的是什麼？列表！一種不能再進行二次篩選（在這裡指依然利用 BeautifulSoup 或者 XPath 語法）的對象！

然而比比 PyQuery，哦我簡直太愛它了！

屬性操作

你可以完全按照 jQuery 的語法來進行 PyQuery 的操作。

from

pyquery

import

PyQuery

(

)(

"p"

)

attr