這5小段代碼輕鬆實現數據可視化（Python+Matplotlib）

知識 05-01

本文要講的是Matplotlib，一個強大的Python可視化庫。一共5小段代碼，輕鬆實現散點圖、折線圖、直方圖、柱狀圖、箱線圖，每段代碼只有10行，也是再簡單不過了吧！

數據可視化是數據科學家工作的一項主要任務。在項目早期階段，通常會進行探索性數據分析（EDA）以獲取對數據的理解和洞察，尤其對於大型高維的數據集，數據可視化著實有助於使數據關係更清晰易懂。

同時在項目結束時，以清晰、簡潔和引人注目的方式展示最終結果也是非常重要的，因為受眾往往是非技術性客戶，只有這樣，他們才更容易去理解。

Matplotlib是個很流行的Python庫，可以輕鬆實現數據可視化。但是，每次執行新項目的繪圖時，設置數據、參數、圖形的過程都非常的繁瑣。在本文中，我們將著眼於5種數據可視化方法，用Python的Matplotlib庫實現一些快速而簡單的功能。

首先，請大家看看這張大的地圖，它能指引你根據不同情況，選擇正確的可視化方法：

根據情況選擇適當的數據可視化技術

散點圖

散點圖非常適合展現兩個變數間關係，因為，圖中可以直接看出數據的原始分布。還可以通過設置不同的顏色，輕鬆地查看不同組數據間的關係，如下圖所示。那如果想要可視化三個變數之間的關係呢？沒問題！只需再添加一個參數（如點的大小）來表示第三個變數就可以了，如下面第二個圖所示。

以顏色分組的散點圖

加入新維度：圓圈大小

現在來寫代碼。首先導入Matplotlib庫的pyplot子庫，並命名為plt。使用 plt.subplots()命令創建一個新的圖。將x軸和y軸數據傳遞給相應數組x_data和y_data，然後將數組和其他參數傳遞給ax.scatter()以繪製散點圖。我們還可以設置點的大小、顏色和alpha透明度，甚至將y軸設置成對數坐標。最後再為該圖設置好必要的標題和軸標籤。這個函數輕鬆地實現了端到端的繪圖！

import

matplotlib.

pyplot as plt

import

numpy as np

def

scatterplot

(x_data, y_data, x_label=

, y_label=

, title=

, color =

"r"

, yscale_log=False)

# Create the plot object
_, ax

= plt.subplots()

# Plot the data,

set

the

size

(s)

, color and

transparency

(alpha)

# of the points

ax.

scatter

(x_data, y_data, s =

, color = color, alpha =

0.75

)

yscale_log

== True:
ax.set_yscale(

"log"

)

# Label the axes and provide a title
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)

折線圖

如果一個變數隨著另一個變數的變化而大幅度變化（具有很高的協方差），為了清楚地看出變數間的關係，最好使用折線圖。例如，根據下圖，我們能清楚地看出，不同專業獲得學士學位的人群中，女性所佔的百分比隨時間變化產生很大變化。

此時，若用散點圖繪製，數據點容易成簇，顯得非常混亂，很難看出數據本身的意義。而折線圖就再合適不過了，因為它基本上反映出兩個變數（女性佔比和時間）協方差的大體情況。同樣，也可使用不同顏色來對多組數據分組。

女性獲得學士學位的百分比（美國）

代碼與散點圖類似，只是一些微小的參數改動。

def

lineplot

(x_data, y_data, x_label=

, y_label=

, title=

)

:
# Create the plot object
_, ax

= plt.subplots()

# Plot the best fit line,

set

the

linewidth

(lw)

, color and

# transparency (alpha) of the

line

ax.

plot

(x_data, y_data, lw =

, color =

"#539caf"

, alpha =

)

# Label the axes and provide a title
ax.

set_title

(title)

ax.

set_xlabel

(x_label)

ax.

set_ylabel

(y_label)

直方圖

直方圖適合查看（或發現）數據分布。下圖為不同IQ人群所佔比例的直方圖。從中可以清楚地看出中心期望值和中位數，看出它遵循正態分布。使用直方圖（而不是散點圖）可以清楚地顯示出不同組數據頻率之間的相對差異。而且，分組（使數據離散化）有助於看出「更宏觀的分布」，若使用未被離散化的數據點，可能會產生大量數據雜訊，從而很難看出數據的真實分布。

正態分布的IQ

下面是用Matplotlib庫創建直方圖的代碼。這裡有兩個參數需要注意。第一個參數是n_bins參數，用於控制直方圖的離散度。一方面，更多的分組數能提供更詳細的信息，但可能會引入數據雜訊使結果偏離宏觀分布；另一方面，更少的分組數能提供更宏觀的數據「鳥瞰」，在不需要太多細節的情況下能更全面地了解數據整體情況。第二個參數是累積參數cumulative，是一個布爾值，通過它控制直方圖是否累積，也就是選擇使用概率密度函數（PDF）還是累積密度函數（CDF）。

def

histogram

(data, n_bins, cumulative=False, x_label =

, y_label =

, title =

)

:
_, ax

= plt.subplots()
ax.hist(data, n_bins = n_bins, cumulative = cumulative, color =

"#539caf"

)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)

如果要比較數據中兩個變數的分布情況該怎麼辦呢？有些人可能會認為，必須要製作兩個獨立的直方圖將它們並排放在一起進行比較。但實際上，有更好的方法：用不同透明度實現直方圖的疊加。比如下圖，將均勻分布透明度設置為0.5，以便看清後面的正態分布。這樣，用戶就可以在同一張圖上查看兩個變數的分布了。

疊加直方圖

在實現疊加直方圖的代碼中需要設置以下幾個參數：

設置水平範圍，以適應兩種可變分布；

根據這個範圍和期望的分組數量，計算並設置組距；

設置其中一個變數具有更高透明度，以便在一張圖上顯示兩個分
布。

# Overlay

histograms to compare them
def

overlaid_histogram

(data1, data2, n_bins =

, data1_name=

, data1_color=

"#539caf"

, data2_name=

, data2_color=

"#7663b0"

, x_label=

, y_label=

, title=

)

:
# Set the bounds

for

the bins so that the two distributions are fairly compared
max_nbins

data_range = [min(min(data1), min(data2)), max(max(data1), max(data2))]
binwidth = (data_range[

] - data_range[

]) / max_nbins

n_bins ==

bins = np.arange(data_range[

], data_range[

] + binwidth, binwidth)

else

:
bins = n_bins

# Create the plot
_, ax = plt.subplots()
ax.hist(data1, bins = bins, color = data1_color, alpha =

, label = data1_name)
ax.hist(data2, bins = bins, color = data2_color, alpha =

0.75

, label = data2_name)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc =

"best"

)

柱狀圖

柱狀圖適用於對類別較少（<10個）的分類數據進行可視化。但在類別太多時，圖中的柱體就會容易堆在一起，顯得非常亂，對數據的理解造成困難。柱狀圖適合於分類數據的原因，一是能根據柱體的高度（即長短）輕鬆地看出類別之間的差異，二是很容易將不同類別加以區分，甚至賦予不同顏色。以下介紹三種類型的柱狀圖：常規柱狀圖，分組柱狀圖和堆積柱狀圖。參考代碼來看詳細的說明。

常規柱狀圖，如下圖所示。代碼中，barplot()函數的x_data參數表示x軸坐標，y_data代表y軸（柱體的高度）坐標，yerr表示在每個柱體頂部中央顯示的標準偏差線。

分組柱狀圖，如下圖所示。它允許對多個分類變數進行對比。如圖所示，兩組關係其一是分數與組（組G1，G2，...等）的關係，其二是用顏色區分的性別之間的關係。代碼中，y_data_list是一個列表，其中又包含多個子列表，每個子列表代表一個組。對每個列表賦予x坐標，循環遍歷其中的每個子列表，設置成不同顏色，繪製出分組柱狀圖。

堆積柱狀圖，適合可視化含有子分類的分類數據。下面這張圖是用堆積柱狀圖展示的日常伺服器負載情況統計。使用不同顏色進行堆疊，對不同伺服器之間進行比較，從而能查看並了解每天中哪台伺服器的工作效率最高，負載具體為多少。代碼與柱狀圖樣式相同，同樣為循環遍歷每個組，只是這次是在舊柱體基礎上堆疊，而不是在其旁邊繪製新柱體。

以下是三種堆積柱狀圖的代碼：

def

barplot

(x_data, y_data, error_data, x_label=

, y_label=

, title=

)

:
_, ax

= plt.subplots()
# Draw bars, position them in the center of the tick mark on the x-axis
ax.bar(x_data, y_data, color =

"#539caf"

, align =

"center"

)
# Draw error bars to show standard deviation,

set

ls to

"none"

# to remove

line

between points

ax.errorbar(x_data, y_data, yerr = error_data, color =

"#297083"

, ls =

"none"

, lw =

, capthick =

)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)

def

stackedbarplot

(x_data, y_data_list, colors, y_data_names=

, x_label=

, y_label=

, title=

)

:
_, ax

= plt.subplots()
# Draw bars,

one category at a time

for

i in

range

(

, len(y_data_list)

:
ax.bar(x_data, y_data_list[i], color = colors[i], align =

"center"

, label = y_data_names[i])

else

:
# For each category after the first, the bottom of the
# bar will be the top of the last category
ax.bar(x_data, y_data_list[i], color = colors[i], bottom = y_data_list[i -

], align =

"center"

, label = y_data_names[i])
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc =

"upper right"

)

def

groupedbarplot

(x_data, y_data_list, colors, y_data_names=

, x_label=

, y_label=

, title=

)

:
_, ax

= plt.subplots()
# Total width

for

all bars at one x location
total_width =

0.8

# Width of each individual bar
ind_width = total_width / len(y_data_list)
# This centers each cluster of bars about the x tick mark
alteration = np.arange(-(total_width/

), total_width/

, ind_width)

# Draw bars, one category at a time

for

i in range(

, len(y_data_list)):
# Move the bar to the right on the x-axis so it doesn"t
# overlap with previously drawn ones
ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc =

"upper right"

)

箱線圖

前文介紹的直方圖非常適合於對變數分布的可視化。但是，如果想要將更多的變數信息可視化呢？比如要清楚地看出標準差，或者一些情況下，中位數與平均值存在很大差異，因此是存在很多異常值呢還是數據分布本身就向一端偏移呢？

這裡，箱線圖就可以表示出上述的所有信息。箱體的底部和頂部分別為第一和第三四分位數（即數據的25％和75％），箱體內的橫線為第二四分位數（即中位數）。箱體上下的延伸線（即T型虛線）表示數據的上下限。

由於箱形圖是為每個組或變數繪製的，因此設置起來非常容易。x_data是組或變數的列表，x_data中的每個值對應於y_data中的一列值（一個列向量）。用Matplotlib庫的函數boxplot()為y_data的每列值（每個列向量）生成一個箱形，然後設定箱線圖中的各個參數就可以了。

def

boxplot

(x_data, y_data, base_color=

"#539caf"

, median_color=

"#297083"

, x_label=

, y_label=

, title=

)

:
_, ax

= plt.subplots()

# Draw boxplots, specifying desired style
ax.boxplot(y_data
# patch_artist must be True to control box fill
, patch_artist = True
# Properties of median line
, medianprops = {

"color"

: median_color}
# Properties of box
, boxprops = {

"color"

: base_color,

"facecolor"

: base_color}
# Properties of whiskers
, whiskerprops = {

"color"

: base_color}
# Properties of whisker caps
, capprops = {

"color"

: base_color})

# By

default

, the tick label starts at

and increments by

for

# each box drawn. This sets the labels to the ones we want

ax.set_xticklabels(x_data)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)

這就是可供你使用的Matplotlib庫的5個快速簡單的數據可視化方法了！將功能和方法包裝成函數，總是會使代碼的編寫和閱讀都變的更簡單！希望這篇文章能對你有所幫助，希望你能從中學到知識！如果喜歡就點個贊吧！

原文鏈接：

https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

TAG: |