Python 求職 Top10 城市，來看看是否有你所在的城市

知識 07-24

（點擊

上方藍字

，快速關注我們）

來源： lemon 的自薦投稿（微信公號：Python數據之道）

如有好文章投稿，請點擊 → 這裡了解詳情

前言

從智聯招聘爬取相關信息後，我們關心的是如何對內容進行分析，獲取用用的信息。

本次以上篇文章「5分鐘掌握智聯招聘網站爬取並保存到MongoDB資料庫」中爬取的數據為基礎，分析關鍵詞為「python」的爬取數據的情況，獲取包括全國python招聘數量Top10的城市列表以及其他相關信息。

一、主要分析步驟

數據讀取

數據整理

對職位數量在全國主要城市的分布情況進行分析

對全國範圍內的職位月薪情況進行分析

對該職位招聘崗位要求描述進行詞雲圖分析，獲取頻率最高的關鍵字

選取兩個城市，分別分析月薪分布情況以及招聘要求的詞雲圖分析

二、具體分析過程

import pymongo

import pandas

import

matplotlib

pyplot

plt

import

numpy

matplotlib

inline

plt

style

use

(

"ggplot"

)

# 解決matplotlib顯示中文問題

plt

rcParams

[

"font.sans-serif"

]

[

"SimHei"

]

# 指定默認字體

plt

rcParams

[

"axes.unicode_minus"

]

False

# 解決保存圖像是負號"-"顯示為方塊的問題

1 讀取數據

client

pymongo

MongoClient

(

"localhost"

)

client

[

"zhilian"

]

table

[

"python"

]

columns

[

"zwmc"

"gsmc"

"zwyx"

"gbsj"

"gzdd"

"fkl"

"brief"

"zw_link"

"_id"

"save_date"

]

# url_set = set([records["zw_link"] for records in table.find()])

# print(url_set)

DataFrame

([

records

for

records

table

find

()],

columns

)

# columns_update = ["職位名稱",

# "公司名稱",

# "職位月薪",

# "公布時間",

# "工作地點",

# "反饋率",

# "招聘簡介",

# "網頁鏈接",

# "_id",

# "信息保存日期"]

# df.columns = columns_update

(

"總行數為：{}行"

format

(

shape

[

]))

head

(

)

結果如圖1所示：

2 數據整理

2.1 將str格式的日期變為 datatime

[

"save_date"

]

to_datetime

(

[

"save_date"

])

(

[

"save_date"

dtype

)

# df["save_date"]

datetime64[ns]

2.2 篩選月薪格式為「XXXX-XXXX」的信息

df_clean

[[

"zwmc"

"gsmc"

"zwyx"

"gbsj"

"gzdd"

"fkl"

"brief"

"zw_link"

"save_date"

]]

# 對月薪的數據進行篩選，選取格式為「XXXX-XXXX」的信息，方面後續分析

df_clean

[

df_clean

[

"zwyx"

str

contains

(

"d+-d+"

regex

True

)]

(

"總行數為：{}行"

format

(

df_clean

shape

[

]))

# df_clean.head()

總行數為：22605行

2.3 分割月薪欄位，分別獲取月薪的下限值和上限值

# http://stackoverflow.com/questions/14745022/pandas-dataframe-how-do-i-split-a-column-into-two

# http://stackoverflow.com/questions/20602947/append-column-to-pandas-dataframe

# df_temp.loc[: ,"zwyx_min"],df_temp.loc[: , "zwyx_max"] = df_temp.loc[: , "zwyx"].str.split("-",1).str #會有警告

s_min

s_max

df_clean

loc

[

"zwyx"

str

split

(

"-"

str

df_min

DataFrame

(

s_min

)

df_min

columns

[

"zwyx_min"

]

df_max

DataFrame

(

s_max

)

df_max

columns

[

"zwyx_max"

]

df_clean_concat

concat

([

df_clean

df_min

df_max

axis

)

# df_clean["zwyx_min"].astype(int)

df_clean_concat

[

"zwyx_min"

]

to_numeric

(

df_clean_concat

[

"zwyx_min"

])

df_clean_concat

[

"zwyx_max"

]

to_numeric

(

df_clean_concat

[

"zwyx_max"

])

# print(df_clean["zwyx_min"].dtype)

(

df_clean_concat

dtypes

)

df_clean_concat

head

(

)

運行結果如圖2所示：

將數據信息按職位月薪進行排序

df_clean_concat

sort_values

(

"zwyx_min"

inplace

True

)

# df_clean_concat.tail()

判斷爬取的數據是否有重複值

# 判斷爬取的數據是否有重複值

(

df_clean_concat

[

df_clean_concat

duplicated

(

"zw_link"

)

True

])

Empty DataFrame

Columns

[

zwmc

gsmc

zwyx

gbsj

gzdd

fkl

brief

zw_link

save_date

zwyx_min

zwyx_max

]

Index

[]

從上述結果可看出，數據是沒有重複的。

3 對全國範圍內的職位進行分析

3.1 主要城市的招聘職位數量分布情況

# from IPython.core.display import display, HTML

ADDRESS

[

"北京"

"上海"

"廣州"

"深圳"

"天津"

"武漢"

"西安"

"成都"

"大連"

"長春"

"瀋陽"

"南京"

"濟南"

"青島"

"杭州"

"蘇州"

"無錫"

"寧波"

"重慶"

"鄭州"

"長沙"

"福州"

"廈門"

"哈爾濱"

"石家莊"

"合肥"

"惠州"

"太原"

"昆明"

"煙台"

"佛山"

"南昌"

"貴陽"

"南寧"

]

df_city

df_clean_concat

copy

()

# 由於工作地點的寫上，比如北京，包含許多地址為北京-朝陽區等

# 可以用替換的方式進行整理，這裡用pandas的replace()方法

for

city

ADDRESS

df_city

[

"gzdd"

]

df_city

[

"gzdd"

replace

([(

city

".*"

)],[

city

regex

True

)

# 針對全國主要城市進行分析

df_city_main

df_city

[

df_city

[

"gzdd"

isin

(

ADDRESS

)]

df_city_main_count

df_city_main

groupby

(

"gzdd"

)[

"zwmc"

"gsmc"

count

()

df_city_main_count

[

"gsmc"

]

df_city_main_count

[

"gsmc"

]

(

df_city_main_count

[

"gsmc"

sum

())

df_city_main_count

columns

[

"number"

"percentage"

]

# 按職位數量進行排序

df_city_main_count

sort_values

(

"number"

ascending

False

inplace

True

)

# 添加輔助列，標註城市和百分比，方面在後續繪圖時使用

df_city_main_count

[

"label"

]

df_city_main_count

index

" "

((

df_city_main_count

[

"percentage"

]

100

round

()).

astype

(

"int"

astype

(

"str"

)

"%"

(

type

(

df_city_main_count

))

# 職位數量最多的Top10城市的列表

(

df_city_main_count

head

(

))

class

"pandas.core.frame.DataFrame"

number percentage label

gzdd

北京

6936

0.315948

北京

上海

3213

0.146358

上海

深圳

1908

0.086913

深圳

成都

1290

0.058762

成都

杭州

1174

0.053478

杭州

廣州

1167

0.053159

廣州

南京

826

0.037626

南京

鄭州

741

0.033754

鄭州

武漢

552

0.025145

武漢

西安

473

0.021546

西安

對結果進行繪圖：

from matplotlib import cm

label

df_city_main_count

[

"label"

]

sizes

df_city_main_count

[

"number"

]

# 設置繪圖區域大小

fig

axes

plt

subplots

(

figsize

(

ncols

)

ax1

ax2

axes

ravel

()

colors

PiYG

(

arange

(

len

(

sizes

))

len

(

sizes

))

# colormaps: Paired, autumn, rainbow, gray,spring,Darks

# 由於城市數量太多，餅圖中不顯示labels和百分比

patches

texts

ax1

pie

(

sizes

labels

None

shadow

False

startangle

colors

)

ax1

axis

(

"equal"

)

ax1

set_title

(

"職位數量分布"

loc

"center"

)

# ax2 只顯示圖例（legend）

ax2

axis

(

"off"

)

ax2

legend

(

patches

label

loc

"center left"

fontsize

)

plt

savefig

(

"job_distribute.jpg"

)

plt

show

()

運行結果如下述餅圖所示：

3.2 月薪分布情況（全國）

from

matplotlib

ticker import FormatStrFormatter

fig

(

ax1

ax2

)

plt

subplots

(

figsize

(

nrows

)

x_pos

list

(

range

(

df_clean_concat

shape

[

]))

df_clean_concat

[

"zwyx_min"

]

ax1

plot

(

x_pos

)

ax1

set_title

(

"Trend of min monthly salary in China"

size

)

ax1

set_xticklabels

(

)

ax1

set_ylabel

(

"min monthly salary(RMB)"

)

bins

[

3000

6000

9000

12000

15000

18000

21000

24000

100000

]

counts

bins

patches

ax2

hist

(

bins

normed

histtype

"bar"

facecolor

"g"

rwidth

0.8

)

ax2

set_title

(

"Hist of min monthly salary in China"

size

)

ax2

set_yticklabels

(

)

# ax2.set_xlabel("min monthly salary(RMB)")

# http://stackoverflow.com/questions/6352740/matplotlib-label-each-bin

ax2

set_xticks

(

bins

)

#將bins設置為xticks

ax2

set_xticklabels

(

bins

rotation

)

# 設置為xticklabels的方向

# Label the raw counts and the percentages below the x-axis...

bin_centers

0.5

diff

(

bins

)

bins

[

]

for

count

zip

(

counts

bin_centers

)

# # Label the raw counts

# ax2.annotate(str(count), xy=(x, 0), xycoords=("data", "axes fraction"),

# xytext=(0, -70), textcoords="offset points", va="top", ha="center", rotation=-90)

# Label the percentages

percent

"%0.0f%%"

(

100

float

(

count

)

counts

sum

())

ax2

annotate

(

percent

(

xycoords

(

"data"

"axes fraction"

xytext

(

textcoords

"offset points"

"top"

"center"

rotation

color

"b"

size

)

fig

savefig

(

"salary_quanguo_min.jpg"

)

運行結果如下述圖所示：

不考慮部分極值後，分析月薪分布情況

df_zwyx_adjust

df_clean_concat

[

df_clean_concat

[

"zwyx_min"

]

20000

]

fig

(

ax1

ax2

)

plt

subplots

(

figsize

(

nrows

)

x_pos

list

(

range

(

df_zwyx_adjust

shape

[

]))

df_zwyx_adjust

[

"zwyx_min"

]

ax1

plot

(

x_pos

)

ax1

set_title

(

"Trend of min monthly salary in China (adjust)"

size

)

ax1

set_xticklabels

(

)

ax1

set_ylabel

(

"min monthly salary(RMB)"

)

bins

[

3000

6000

9000

12000

15000

18000

21000

]

counts

bins

patches

ax2

hist

(

bins

normed

histtype

"bar"

facecolor

"g"

rwidth

0.8

)

ax2

set_title

(

"Hist of min monthly salary in China (adjust)"

size

)

ax2

set_yticklabels

(

)

# ax2.set_xlabel("min monthly salary(RMB)")

# http://stackoverflow.com/questions/6352740/matplotlib-label-each-bin

ax2

set_xticks

(

bins

)

#將bins設置為xticks

ax2

set_xticklabels

(

bins

rotation

)

# 設置為xticklabels的方向

# Label the raw counts and the percentages below the x-axis...

bin_centers

0.5

diff

(

bins

)

bins

[

]

for

count

zip

(

counts

bin_centers

)

# # Label the raw counts

# ax2.annotate(str(count), xy=(x, 0), xycoords=("data", "axes fraction"),

# xytext=(0, -70), textcoords="offset points", va="top", ha="center", rotation=-90)

# Label the percentages

percent

"%0.0f%%"

(

100

float

(

count

)

counts

sum

())

ax2

annotate

(

percent

(

xycoords

(

"data"

"axes fraction"

xytext

(

textcoords

"offset points"

"top"

"center"

rotation

color

"b"

size

)

fig

savefig

(

"salary_quanguo_min_adjust.jpg"

)

運行結果如下述圖所示：

3.3 相關技能要求

brief_list

list

(

df_clean_concat

[

"brief"

])

brief_str

join

(

brief_list

)

(

type

(

brief_str

))

# print(brief_str)

# with open("brief_quanguo.txt", "w", encoding="utf-8") as f:

# f.write(brief_str)

對獲取到的職位招聘要求進行詞雲圖分析，代碼如下：

# -*- coding: utf-8 -*-

"""

Created on Wed May 17 2017

@author: lemon

"""

import jieba

from wordcloud import

WordCloud

ImageColorGenerator

import

matplotlib

pyplot

plt

import os

import

PIL

Image

import numpy

with open

(

"brief_quanguo.txt"

"rb"

)

# 讀取文件內容

text

read

()

# 首先使用 jieba 中文分詞工具進行分詞

wordlist

jieba

cut

(

text

cut_all

False

)

# cut_all, True為全模式，False為精確模式

wordlist_space_split

" "

join

(

wordlist

)

path

dirname

(

__file__

)

alice_coloring

array

(

Image

open

(

path

join

(

"colors.png"

)))

my_wordcloud

WordCloud

(

background_color

"#F0F8FF"

max_words

100

mask

alice_coloring

max_font_size

300

random_state

generate

(

wordlist_space_split

)

image_colors

ImageColorGenerator

(

alice_coloring

)

plt

show

(

my_wordcloud

recolor

(

color_func

image_colors

))

plt

imshow

(

my_wordcloud

)

# 以圖片的形式顯示詞雲

plt

axis

(

"off"

)

# 關閉坐標軸

plt

show

()

my_wordcloud

to_file

(

path

join

(

"brief_quanguo_colors_cloud.png"

))

得到結果如下：

4 北京

4.1 月薪分布情況

df_beijing

df_clean_concat

[

df_clean_concat

[

"gzdd"

str

contains

(

"北京.*"

regex

True

)]

df_beijing

to_excel

(

"zhilian_kw_python_bj.xlsx"

)

(

"總行數為：{}行"

format

(

df_beijing

shape

[

]))

# df_beijing.head()

總行數為：6936行

參考全國分析時的代碼，月薪分布情況圖如下：

4.2 相關技能要求

brief_list_bj

list

(

df_beijing

[

"brief"

])

brief_str_bj

join

(

brief_list_bj

)

(

type

(

brief_str_bj

))

# print(brief_str_bj)

# with open("brief_beijing.txt", "w", encoding="utf-8") as f:

# f.write(brief_str_bj)

詞雲圖如下：

5 長沙

5.1 月薪分布情況

df_changsha

df_clean_concat

[

df_clean_concat

[

"gzdd"

str

contains

(

"長沙.*"

regex

True

)]

# df_changsha = pd.DataFrame(df_changsha, ignore_index=True)

df_changsha

to_excel

(

"zhilian_kw_python_cs.xlsx"

)

(

"總行數為：{}行"

format

(

df_changsha

shape

[

]))

# df_changsha.tail()

總行數為：280行

參考全國分析時的代碼，月薪分布情況圖如下：

5.2 相關技能要求

brief_list_cs

list

(

df_changsha

[

"brief"

])

brief_str_cs

join

(

brief_list_cs

)

(

type

(

brief_str_cs

))

# print(brief_str_cs)

# with open("brief_changsha.txt", "w", encoding="utf-8") as f:

# f.write(brief_str_cs)

詞雲圖如下：

看完本文有收穫？請轉

發分享給更多人

關注「P

ython開發者」，提升Python技能

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 Python開發者 的精彩文章:

※機器學習演算法實踐：樸素貝葉斯 (Naive Bayes)
※這2個套路走完, 你就成了 Facebook 認證的數據分析師
※動手實現推薦系統，挑戰高薪！
※5 分鐘掌握智聯招聘網站爬取並保存到 MongoDB 資料庫
※全棧開發者都應該關注這些

TAG:Python開發者 |

您可能感興趣

※新男團Nine Percent全國巡迴粉絲見面會，看看有沒有你的城市
※「Pop-up store」in 國大城市廣場
※Alphabet 旗下的 Sidewalk Labs 成立了一家子公司，想用大數據讓城市美好
※6大城市更新！AJ 1 OG WMNS Satin 「Shattered Backboard」抽籤開啟
※NBA 城市主題！多款 Air Force 1 iD 「City Edition」現已上架
※6大城市支持使用Apple Watch3
※Cities：Skylines–Concerts for Mac 1.8.0 現代城市模擬遊戲含中文
※Aussie rogue-lite城市的Brass眼睛注視著Early
※Breaking2一周年紀念NIKE又在憋大招？Zoom Fly SP城市系列首次完整曝光！
※Python爬蟲系列：使用selenium＋Edge查詢指定城市天氣情況
※它被Lonely Planet評為年度最佳旅行城市，在這兒，跳舞喝酒才是正事
※Swaine Adeney Brigg | 在這座被雨困住的城市，邂逅「雨傘中的勞斯萊斯」
※SOUL.D送你一本Urban Style穿搭指南，喚醒沉寂一冬的城市街頭
※「最適合比特幣的城市」Arnhem Bitcoinstad網站不再接受比特幣支付
※城市之光，Ivo van de Grift插畫作品
※Google Pay現在可以處理城市交通票支付
※蟬翼鞋面顏值不俗！Nike Zoom Fly SP 帶來城市系列
※歐洲城市游秘密檔案 https：//www.dolc.de
※《城市：天際線》4月加入Xbox Game Pass
※插畫：快樂行走，有你的城市我不寂寞-韓國yalzza