愉快地遷移到 Python 3

知識 03-22

（點擊

上方公眾號

，可快速關注）

編譯： Python開發者 -

衝動老少年英文：Alex Rogozhnikov

http://python.jobbole.com/89031/

為數據科學家準備的 Python 3 特性指南

Python 已經成為機器學習和一些需處理大量數據的科學領域的主流語言。它支持了許多深度學習框架和其他已確立下來的數據處理和可視化的工具集。

然而，Python 生態系統還處於 Python 2 和 Python 3 並存的狀態，且 Python 2 仍然被數據科學家們所使用。從 2019 年底開始，系統工具包將會

停止對 Python 2 的支持

。對於 numpy，2018 年之後任何更新將

只支持 Python 3

。

為了讓大家能夠順利過渡，我收集了一系列 Python 3 特性，希望對大家有用。

圖片來源於

Dario Bertini post (toptal)

使用 pathlib 更好的對路徑進行處理

pathlib 是 Python 3 中的默認模塊，能幫你避免過多的使用 os.path.join：

from

pathlib

import

Path

dataset

"wiki_images"

datasets_root

Path

(

"/path/to/datasets/"

)

train_path

datasets_root

dataset

"train"

test_path

datasets_root

dataset

"test"

for

image_path

train_path

iterdir

()

with

image_path

open

()

# note, open is a method of Path object

# do something with an image

在之前的版本中總是不可避免的使用字元串連接（簡潔但明顯可讀性很差），如今使用 pathlib 後，代碼會更安全、簡潔、易讀。

同時 pathlib.Path 提供了一系列方法和特性，這樣一來 python 的初學者就不需搜索了：

exists

()

is_dir

()

parts

with_name

(

"sibling.png"

)

# only change the name, but keep the folder

with_suffix

(

".jpg"

)

# only change the extension, but keep the folder and the name

chmod

(

mode

)

rmdir

()

pathlib 會節省你大量的時間，具體用法請參考

文檔

和

說明

。

類型提示現在是 Python 的一部分啦

Pycharm 中類型提示示例：

Python 已不再是一個小型的腳本語言了，如今的數據處理流程包含許多步驟，每步涉及不同的構架（而且有時會涉及不同的邏輯）

引入類型提示功能有助於處理日漸複雜的程序，因此機器就可以幫助實現代碼驗證。而以前是不同的模塊需使用自定義的方式在

文檔字元串（doctrings）中指定類型

（提示：pycharm 能夠將舊的 doctrings 轉換為新的 Type hinting）。

下圖是一個簡單的例子，這段代碼對不同類型的數據均有效（這正是我們喜歡 Python 數據棧的原因）。

def repeat_each_entry

(

data

)

""" Each entry in the data is doubled

"""

index

numpy

repeat

(

numpy

arange

(

len

(

data

)),

)

return

data

[

index

]

這段代碼樣例適用於 numpy.array（含多維數組）、astropy.Table 及 astropy.Column、bcolz、cupy、mxnet.ndarray 等。

這段代碼可用於 pandas.Series，但方式不對：

repeat_each_entry

(

pandas

Series

(

data

[

index

[

]))

# returns Series with Nones inside

這還僅是兩行代碼。想像一下一個複雜系統的行為將是多麼的難以預測，僅因一個函數就有可能行為失常。在大型系統中，明確各類方法的期望類型是非常有幫助的，這樣會在函數未得到期望的參數類型時給出警告。

def repeat_each_entry

(

data

Union

[

numpy

ndarray

bcolz

carray

])

如果你有重要的代碼庫， MyPy 這樣的提示工具很可能成為持續集成途徑的一部分。由 Daniel Pyrathon 發起的名為「讓類型提示生效」的在線教程可為您提供一個很好的介紹。

旁註：不幸的是，類型提示功能還未強大到能為 ndarrays 或 tensors 提供細粒度分型，但是或許我們很快就可擁有，這也將是 DS 的特色功能。

類型提示→運行中的類型檢查

在默認情況下，函數注釋不會影響你代碼的運行，但也僅能提示你代碼的目的。

然而，你可以使用像 enforce 這樣的工具在運行中強制類型檢查，這有助你調試（當類型提示不起作用時會出現很多這樣的情況）。

enforce

runtime_validation

def foo

(

text

str

)

None

(

text

)

foo

(

"Hi"

)

# ok

foo

(

)

# fails

enforce

runtime_validation

def any2

(

List

[

bool

])

bool

return

any

(

)

any

([

False

True

False

])

# True

any2

([

False

True

False

])

# True

any

([

"False"

])

# True

any2

([

"False"

])

# fails

any

([

False

None

])

# False

any2

([

False

None

])

# fails

函數注釋的其他慣例

如前所述，函數注釋不會影響代碼的執行，但是它可提供一些供你隨意使用的元信息（譯者註：關於信息的信息）。

例如，計量單位是科學領域的一個常見問題，astropy 包能夠

提供一種簡單裝飾器

來控制輸入量的單位並將輸出量轉換成所需單位。

# Python 3

from astropy import units

quantity_input

()

def frequency

(

speed

meter

wavelength

)

terahertz

return

speed

wavelength

frequency

(

speed

300_000

wavelength

555

)

# output: 540.5405405405404 THz, frequency of green visible light

如果你正使用 Python 處理表格式科學數據（數據量很大），你應該試一試 astropy。

你也可以定義你的專用裝飾器以同樣的方法對輸入量和輸出量進行控制或轉換。

使用 @ 進行矩陣乘積

我們來執行一個簡單的機器學習模型，帶 L2 正則化的線性回歸（也稱脊回歸）：

# l2-regularized linear regression: || AX - b ||^2 + alpha * ||x||^2 -> min

# Python 2

linalg

inv

(

dot

(

)

alpha *

eye

(

shape

[

])).

dot

(

dot

(

))

# Python 3

linalg

inv

(

alpha *

eye

(

shape

[

]))

(

)

使用 @ 的代碼更可讀也更容易在各深度學習架構間轉譯：一個單層感知器可以在　numpy、cupy、pytorch、tensorflow（和其他操作張量的框架）下運行相同的代碼　X @ W + b[None, :] 實現。

使用 ** 作通配符

遞歸文件夾的通配符在 python 2 中實現起來並不簡單，實際上我們要自定義 glob2 模塊來克服這個問題。而從 Python 3.6 以後將支持遍歷標誌：

import

glob

# Python 2

found_images

glob

(

"/path/*.jpg"

)

glob

(

"/path/*/*.jpg"

)

glob

(

"/path/*/*/*.jpg"

)

glob

(

"/path/*/*/*/*.jpg"

)

glob

(

"/path/*/*/*/*/*.jpg"

)

# Python 3

found_images

glob

(

"/path/**/*.jpg"

recursive

True

)

在 python 3 中有更好的選擇，那就是使用 pathlib（-1 導入！）：

# Python 3

found_images

pathlib

Path

(

"/path/"

glob

(

"**/*.jpg"

)

Print 現在是函數

沒錯，現在寫代碼需要這些煩人的圓括弧，但是這有許多好處：

簡化使用文件描述符的語法：

sys

stderr

"critical error"

# Python 2

(

"critical error"

file

sys

stderr

)

# Python 3

無需 str.join 輸出製表符：

# Python 3

(

array

sep

"t"

)

(

batch

epoch

loss

accuracy

time

sep

"t"

)

改寫或重定義 print 的輸出

# Python 3

_print

# store the original print function

def print

(

args

kargs

)

pass

# do something useful, e.g. store output to some file

在

jupyter

中，可以將每一個輸出記錄到一個獨立的文檔（以跟蹤斷線之後發生了什麼），這樣一來我們就可以重寫

函數了。

下面你可以看到名為

contextmanager

的裝飾器暫時重寫

函數的方式：

contextlib

contextmanager

def replace_print

()

import builtins

_print

# saving old print function

# or use some other function here

builtins

lambda *

args

kwargs

_print

(

"new printing"

args

kwargs

)

yield

builtins

_print

with replace_print

()

code here will invoke other print

function

這種方法並

不

推薦，因為此時有可能出現些小問題。

print 函數可參與列表理解和其他語言構建。

# Python 3

result

process

(

)

is_valid

(

)

else

(

"invalid item: "

)

數值中的下劃線（千位分隔符）

PEP-515 在數值中引入下劃線。在 Python 3 中，下劃線可用於整數、浮點數、複數的位數進行分組，增強可視性。

# grouping decimal numbers by thousands

one_million

1_000_000

# grouping hexadecimal addresses by words

addr

0xCAFE_F00D

# grouping bits into nibbles in a binary literal

flags

0b_0011_1111_0100_1110

# same, for string conversions

flags

int

(

"0b_1111_0000"

)

使用 f-strings 簡便可靠的進行格式化

默認的格式化系統具有一定的靈活性，但這卻不是數據實驗所需要的。這樣改動後的代碼要麼太冗長，要不太零碎。

典型的數據科學的代碼會反覆的輸出一些固定格式的日誌信息。常見代碼格式如下：

# Python 2

(

"{batch:3} {epoch:3} / {total_epochs:3} accuracy: {acc_mean:0.4f}±{acc_std:0.4f} time: {avg_time:3.2f}"

format

(

batch

epoch

total_epochs

acc_mean

numpy

mean

(

accuracies

acc_std

numpy

std

(

accuracies

avg_time

time

len

(

data_batch

)

))

# Python 2 (too error-prone during fast modifications, please avoid):

(

"{:3} {:3} / {:3} accuracy: {:0.4f}±{:0.4f} time: {:3.2f}"

format

(

batch

epoch

total_epochs

numpy

mean

(

accuracies

numpy

std

(

accuracies

time

len

(

data_batch

)

))

樣本輸出：

120

300

accuracy

0.8180

0.4649

time

56.60

f-strings

全稱為格式化字元串，引入到了

Python 3.6：

# Python 3.6+

(

"{batch:3} {epoch:3} / {total_epochs:3} accuracy: {numpy.mean(accuracies):0.4f}±{numpy.std(accuracies):0.4f} time: {time / len(data_batch):3.2f}"

)

同時，寫查詢或者進行代碼分段時也非常便利：

query

"INSERT INTO STATION VALUES (13, {city!r}, {state!r}, {latitude}, {longitude})"

重點：別忘了轉義字元以防 SQL 注入攻擊。

『真實除法』與『整數除法』的明確區別

這對於數據科學而言是非常便利的改變（但我相信對於系統編程而言卻不是）

data

pandas

read_csv

(

"timing.csv"

)

velocity

data

[

"distance"

]

data

[

"time"

]

在 Python 2 中結果正確與否取決於『時間』和『距離』（例如，以秒和米做測量單位）是否以整型來存儲。而在 Python 3 中這兩種除法的結構都正確，因為商是以浮點型存儲的。

另一個案例是整數除法現在已經作為一個顯式操作：

n_gifts

money

// gift_price # correct for int and float arguments

注意：這個特性既適用於內置類型又適用於由數據包（比如：numpy 或者 pandas）提供的自定義類型。

嚴格排序

# All these comparisons are illegal in Python 3

"3"

None

(

)

(

None

)

(

)

[

]

# False in both Python 2 and Python 3

(

)

[

]

防止不同類型實例的偶然分類

sorted

([

"1"

])

# invalid for Python 3, in Python 2 returns [2, 3, "1"]

有助於指示處理原始數據時發生的問題

旁註：適當的檢查 None（兩種版本的 Python 均需要）

not

None

pass

# WRONG check for None

pass

自然語言處理（NLP）中的統一編碼標準（Unicode）

"您好"

(

len

(

))

(

[

])

輸出：

Python 2:
6n??

Python 3:
2n您好
.

"со"

"co"

# ok

"со"

# fail

Python 2 失效而 Python 3 如期輸出（因為我在字元串中使用了俄文字母）

在 Python 3 中 strs 是 unicode 字元串，這更方便處理非英語文本的 NPL。

還有其他好玩的例子，比如：

"a"

type

"a"

# Python 2: True

"a"

# Python 2: False

from collections import Counter

Counter

(

"M?belstück"

)

Python 2 是：Counter({『xc3』: 2, 『b』: 1, 『e』: 1, 『c』: 1, 『k』: 1, 『M』: 1, 『l』: 1, 『s』: 1, 『t』: 1, 『xb6』: 1, 『xbc』: 1})

Python 3 是：Counter({『M』: 1, 『?』: 1, 『b』: 1, 『e』: 1, 『l』: 1, 『s』: 1, 『t』: 1, 『ü』: 1, 『c』: 1, 『k』: 1})

雖然在 Python 2 中這些可以正確處理，但在 Python 3 下會更加友好。

字典和 **kwargs 的保存順序

在 CPython 3.6+ 中，默認情況下字典的行為類似於 OrderedDict（

這在 Python 3.6+ 版本中已被保證

）。這樣可在理解字典（及其他操作，例如：json 序列化或反序列化）時保持了順序。

import

json

{

str

(

)

for

range

(

)}

json

loads

(

json

dumps

(

))

# Python 2

{

"1"

"0"

"3"

"2"

"4"

}

# Python 3

{

"0"

"1"

"2"

"3"

"4"

}

這同樣適用於 **kwargs（在 Python 3.6+ 中），即按照 **kwargs 在參數中出現的順序來保存。在涉及到數據流這個順序是至關重要的，而以前我們不得不用一種麻煩的方法來實現。

from torch import

# Python 2

model

Sequential

(

OrderedDict

([

(

"conv1"

Conv2d

(

)),

(

"relu1"

ReLU

()),

(

"conv2"

Conv2d

(

)),

(

"relu2"

ReLU

())

]))

# Python 3.6+, how it *can* be done, not supported right now in pytorch

model

Sequential

(

conv1

Conv2d

(

relu1

ReLU

(),

conv2

Conv2d

(

relu2

ReLU

())

)

你注意到了嗎？命名的惟一性也是自動檢查的。

迭代拆封

# handy when amount of additional stored info may vary between experiments, but the same code can be used in all cases

model_paramteres

optimizer_parameters

other_params

load

(

checkpoint_name

)

# picking two last values from a sequence

next_to_last

last

values_history

# This also works with any iterables, so if you have a function that yields e.g. qualities,

# below is a simple way to take only last two values from a list

next_to_last

last

iter_train

(

args

)

使用默認的 pickle 工具更好的壓縮數組

# Python 2

import cPickle

pickle

import numpy

print len

(

pickle

dumps

(

numpy

random

normal

(

size

[

1000

])))

# result: 23691675

# Python 3

import pickle

import numpy

len

(

pickle

dumps

(

numpy

random

normal

(

size

[

1000

])))

# result: 8000162

節省三倍空間，並且

更

快速。實際上 protocol=2 參數可以實現相同的壓縮（但是速度不行），但是使用者基本上都會忽視這個選項（或者根本沒有意識到）。

更安全的解析

labels

= <

initial_value

predictions

[

model

predict

(

data

)

for

data

labels

dataset

]

# labels are overwritten in Python 2

# labels are not affected by comprehension in Python 3

超級簡單的 super 函數

在 Python 2 中，super(…) 是代碼里常見的錯誤源。

# Python 2

class

MySubClass

(

MySuperClass

)

def __init__

(

self

name

options

)

super

(

MySubClass

self

__init__

(

name

"subclass"

options

)

# Python 3

class

MySubClass

(

MySuperClass

)

def __init__

(

self

name

options

)

super

().

__init__

(

name

"subclass"

options

)

更多關於 super 函數及其方法的解析順序參見 stackoverflow.

更好的 IDE：支持變數注釋

使用類似 Java、C# 這類編程語言最享受的就是 IDE 會給出非常棒的建議，因為在執行程序前每種標識符都是已知的。

在 Python 中這是很難實現的，但是變數注釋可以幫你

以一種清晰的格式寫出你的期望值

從 IDE 中得到很好的建議

這是一個 PyCharm 中使用變數注釋的例子。即使你使用的函數是未注釋的（比如：由於向後兼容），這也仍然生效。

多重拆封

如下是現在如何合併兩個字典：

dict

(

)

dict

(

)

# Python 3.5+

{

}

# z = {"a": 1, "b": 3, "d": 4}, note that value for `b` is taken from the latter dict.

可參見 StackOverflow 中的帖子與 Python 2 比較。

同樣的方法也適用於列表（list），元組（tuple）和集合（set）（a，b，c 是任意可迭代對象）：

[

]

# list, concatenating

(

)

# tuple, concatenating

{

}

# set, union

對於使用的 *args 和 **kwargs 的函數也

同樣支持

：

Python

3.5

do_something

(

{

default_settings

custom_settings

})

# Also possible, this code also checks there is no intersection between keys of dictionaries

do_something

(

first_args

second_args

)

永不過時的 API：使用僅帶關鍵字的參數

我們考慮下這段代碼

model

sklearn

svm

SVC

(

"poly"

0.5

)

很明顯，這段代碼的作者還沒有掌握 Python 的代碼風格（作者極可能是剛從 C++ 或者 Rust 跳過來的）。很不幸，這個問題不僅僅是風格的問題，因為在 SVC 函數中改變參數順序（增或刪）會導致代碼崩潰。特別是函數 sklearn 會經常對大量的演算法參數進行重拍序或重命名以保持和 API 的一致性。每次的重構都可能導致破壞代碼。

在 Python 3 中，庫的編寫者可使用 * 來明確的命名參數：

class

SVC

(

BaseSVC

)

def __init__

(

self

1.0

kernel

"rbf"

degree

gamma

"auto"

coef0

0.0

...

)

現在使用者必須明確輸入參數名，比如：sklearn.svm.SVC(C=2, kernel=』poly』, degree=2, gamma=4, coef0=0.5)

這種機制將 API 的可靠性和靈活性進行了極好的融合

次重點：math 模塊中的常量

# Python 3

math

inf

# "largest" number

math

nan

# not a number

max_quality

= -

math

inf

# no more magic initial values!

for

model

trained_models

max_quality

max

(

max_quality

compute_quality

(

model

data

))

次重點：單整型

Python 2 提供兩種基本的整型：int 型（64 位有符整型）和用於長時計算的 long 型（C++ 後非常讓人困惑）。

Python 3 有單精度的 int 型，它整合了長時計算的要求。

下面是怎樣檢查整型值：

isinstance

(

numbers

Integral

)

# Python 2, the canonical way

isinstance

(

long

int

))

# Python 2

isinstance

(

int

)

# Python 3, easier to remember

其他

Enums 理論是有用處的，但：

在 python 的數據棧中，字元串輸入已被廣泛採用

Enums 似乎並不與 numpy 交互，也不屬於 pandas 範疇

協程
聽起來
也很有希望做數據流程（參考 David Beazley 的
幻燈片
），但是我還沒有看到他們被採用。

Python 3 有
穩定的 ABI

Python 3 支持 unicode（因此 ω = Δφ / Δt 是可以的），但是
最好還是使用雖舊但好用的 ASCII 碼
。

一些庫，比如：
jupyterhub
（雲服務版 jupyter）、django 和新版 ipython，只支持 Python 3，因此一些聽起來對你無用的特性卻對那些你也許只想用一次的庫非常有用。

數據科學特有的代碼遷移難題（以及如何解決它們）

放棄支持
嵌套參數

map(lambda x, (y, z): x, z, dict.items())

然而，它依然能很好的對不同的理解起效。

{

for

(

)

items

()}

通常來說，在 Python 2 和 Python 3 之間，理解也更好於『翻譯』。

map()、.keys()、.values()、.items() 返回的是迭代器而不是列表。迭代器的主要問題是：

沒有瑣碎的分片

不能迭代兩次

幾乎全部的問題都可以通過將結果轉化為列表來解決。

當遇到麻煩時，參見
Python 問答：我如何遷移到 Python 3

使用 python 教授機器學習和數據科學的主要問題

教授者應該首先花時間講解什麼是迭代器，它不能像字元串一樣被分片、級聯、倍乘、迭代兩次（以及如何處理）。

我認為大部分教授者會很高興規避這些細節，但是現在這幾乎是不可能的。

結論

Python 2 與 Python 3 共存了將近 10 年，但是我們

應當

轉移到 Python 3 了。

遷移到僅有 Python 3 的代碼庫後，研究所寫的代碼和產品開發的代碼都會變得更簡短、更可讀、更安全。

現在大部分庫同時支持這兩個 Python 版本。我都有點等不及了，工具包放棄支持 Python 2 享受新語言特性的美好時刻快來吧。

遷移後代碼絕對會更順暢，參見

「我們再也不要向後兼容啦！」

參考

Key differences between Python 2.7 and Python 3.x

Python FAQ: How do I port to Python 3?

10 awesome features of Python that you can』t use because you refuse to upgrade to Python 3

Trust me, python 3.3 is better than 2.7 (video)

Python 3 for scientists

看完本文有收穫？請轉

發分享給更多人

關注「P

ython開發者」，提升Python技能

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 Python開發者 的精彩文章:

※面向對象：爸媽的愛情，是我見過最美好的愛情
※面向對象：樂觀上進正能量，有社會責任心

TAG:Python開發者 |