教程 | 強化學習訓練Chrome小恐龍Dino Run：最高超過4000分

科技 06-03

選自Paperspace

作者：Ravi Munde

機器之心編譯

參與：Panda

強化學習是當前人工智慧領域內一個非常熱門的研究方向，在遊戲智能體方面的進展尤其耀眼。美國東北大學在讀碩士 Ravi Munde 近日發文介紹了其構建《小恐龍快跑（Dino Run）》強化學習智能體的過程。《小恐龍快跑》是 Chrome 瀏覽器上的一款隱藏小遊戲，當你的瀏覽器斷開網路時，你的屏幕上就會出現這隻小恐龍，此時只需點擊方向鍵 ↑ 即可開啟遊戲。

DeepMind 2013 年發表的論文《使用深度強化學習玩 Atari 遊戲（Playing Atari with Deep Reinforcement Learning）》為強化學習引入了一種新的深度學習模型，並展現了其僅使用原始像素作為輸入就能掌握 Atari 2600 電腦遊戲的不同控制策略的能力。在本教程中，我將使用 Keras 實現這篇論文。我首先會介紹強化學習的基礎知識，然後深入代碼以獲得實踐性的理解。

AI 玩《小恐龍快跑》

我在 2018 年 3 月初開始了這個項目並得到了一些優良的結果。但是，這個只有 CPU 的系統沒法學習更多功能。強大的 GPU 能顯著提升其性能表現。

在我們得到一個可運行的模型之前，有很多步驟和概念需要我們理解。

步驟：

構建一個瀏覽器（JavaScript）和模型（Python）之間的雙向介面

獲取和預處理圖像

訓練模型

評估

源代碼：https://github.com/Paperspace/DinoRunTutorial.git

開始

要這樣完成訓練和玩遊戲，你可以在設置好環境後克隆這個 GitHub 庫：

git clone https
: //github.com/Paperspace/DinoRunTutorial.git

然後在 Jupyter Notebook 上操作

Reinforcement

Learning

Dino

Run
.
ipynb

要確保你首先運行了 init_cache() 來初始化文件系統結構。

強化學習

一個學習走路的孩子

對很多人來說，這可能是一個新辭彙，但我們每個人都使用強化學習（RL）的概念學習過走路，而且我們的大腦現在依然這樣運作。獎勵系統是任何強化學習演算法的基礎。如果我們回到小孩走路的比喻，正面獎勵可能是父母的掌聲或拿到糖果；負面獎勵就是沒有糖果。然後，孩子在開始走路之前首先要學會站立。就人工智慧而言，智能體（我們這裡就是小恐龍 Dino）的主要目標是通過在環境中執行一個特定的動作序列來最大化特定的數值獎勵。強化學習中最大的難題是沒有監督（有標註數據）來引導智能體。它必須進行探索，靠自己學習。智能體首先會隨機執行動作，然後觀察每個動作所產生的獎勵，再學習預測面臨相似的環境狀態時可能最好的動作。

最簡單純粹的強化學習框架

Q 學習（Q-learning）

Q 學習是一種強化學習技術，在這種技術中，我們試圖近似一個特定函數，使其能為任意環境狀態序列得到動作-選擇策略。Q 學習是強化學習的一種無模型的實現，其中維護著一個相對每個狀態、所採取的動作和所得到的獎勵的 Q 值表。一個樣本 Q 值表應該能讓我們了解數據的結構。在我們的案例中，狀態即是遊戲截屏，動作則是什麼也不做和跳 [0,1]

一個樣本 Q 值表

我們使用深度神經網路，通過回歸方法來解決這一問題，然後選擇有最高預測 Q 值的動作。若想詳細了解 Q 學習，可參看 Tambet Matiisen 的這篇出色文章：https://ai.intel.com/demystifying-deep-reinforcement-learning/。你也可以參看我之前的文章，了解 Q 學習的所有超參數：https://medium.com/acing-ai/how-i-build-an-ai-to-play-dino-run-e37f37bdf153

設置

首先設置訓練過程所需的環境。

1. 選擇虛擬機（VM）：我們需要一個完整的桌面環境，讓我們可以在其中獲取截屏並將其用於訓練。我選擇了一個 Paperspace ML-in-a-box (MLIAB) Ubuntu 鏡像。MLIAB 的優勢在於預裝了 Anaconda 和很多其它的機器學習庫。

ML-in-a-box (MLIAB)

2. 配置和安裝 Keras 並使用 GPU

我們需要安裝 Keras 和 TensorFlow 的 GPU 版本。Paperspace 的虛擬機預裝了這些，但如果沒有安裝，可以執行下列操作來安裝：

pip install keras
pip install tensorflow

另外，要確保 GPU 能被識別出來。執行下列 Python 代碼，你應該能看到可用的 GPU 設備：

from keras
import
backend as K
K
.
tensorflow_backend
.
_get_available_gpus
()

3. 安裝依賴包

Selenium： pip install selenium

OpenCV： pip install opencv-python

下載 Chromedriver：http://chromedriver.chromium.org

遊戲框架

你可以將你的瀏覽器指向 chrome://dino 或直接拔下網路插口來啟動遊戲。另一種方法是從 Chromium 的開源庫提取這個遊戲——如果我們想要修改遊戲代碼的話。

我們的模型是用 Python 寫的，而遊戲是用 JavaScript 構建的。要讓它們之間能進行通信，我們需要一些介面工具。

Selenium 是一種常用的瀏覽器自動化工具，可用於向瀏覽器發送動作和獲取當前分數等不同的遊戲參數。

現在我們有可以向遊戲發送動作的介面了，我們還需要一個獲取遊戲畫面的機制。

Selenium 和 OpenCV 能分別為截屏和圖像預處理提供最佳的表現，能實現 6-7 FPS 的幀率。

我們只需要 4 FPS 的幀率，所以足夠了。

遊戲模塊

我們使用這個模塊實現了 Python 和 JavaScript 之間的介面。下面的代碼片段應該能讓你一窺該模塊所執行的功能。

class

Game
:
def __init__
(
self
):
self
.
_driver
=
webdriver
.
Chrome
(
executable_path
=
chrome_driver_path
)
self
.
_driver
.
set_window_position
(
x
=-
10
,
y
=
0
)
self
.
_driver
.
get
(
game_url
)
def restart
(
self
):
self
.
_driver
.
execute_script
(
"Runner.instance_.restart()"
)
def press_up
(
self
):
self
.
_driver
.
find_element_by_tag_name
(
"body"
).
send_keys
(
Keys
.
ARROW_UP
)
def get_score
(
self
):
score_array
=
self
.
_driver
.
execute_script
(
"return Runner.instance_.distanceMeter.digits"
)
score
=

""
.
join
(
score_array
).
return

int
(
score
)

智能體模塊

我們使用智能體模塊封裝了所有介面。我們使用這一模塊控制小恐龍 Dino 以及獲取智能體在環境中的狀態。

class

DinoAgent
:
def __init__
(
self
,
game
):

#
takes game as input
for
taking actions
self
.
_game
=
game
;
self
.
jump
();

#
to start the game
,
we need to jump once
def is_crashed
(
self
):
return
self
.
_game
.
get_crashed
()
def jump
(
self
):
self
.
_game
.
press_up
()

遊戲狀態模塊

為了將動作發送給模塊並得到由該動作導致的環境轉換的結果狀態，我們使用了遊戲-狀態模塊。通過接收&執行動作、確定獎勵和返回經曆元組，其簡化了這一過程。

class

Game_sate
:
def __init__
(
self
,
agent
,
game
):
self
.
_agent
=
agent
self
.
_game
=
game
def get_state
(
self
,
actions
):
score
=
self
.
_game
.
get_score
()
reward
=

0.1

#
survival reward
is_over
=

False

#
game over
if
actions
[
1
]

==

1
:

#
else

do
nothing
self
.
_agent
.
jump
()
image
=
grab_screen
(
self
.
_game
.
_driver
)
if
self
.
_agent
.
is_crashed
():
self
.
_game
.
restart
()
reward
=

-
1
is_over
=

True
return
image
,
reward
,
is_over
#
return
the
Experience
tuple

圖像處理流程

獲取圖像

獲取遊戲畫面的方法有很多，比如使用 PIL 和 MSS Python 庫來獲取整個屏幕的截屏然後裁剪相關區域。但是，其最大的缺點是對屏幕解析度和窗口位置的敏感度。幸運的是，該遊戲使用了 HTML Canvas。我們可以使用 JavaScript 輕鬆獲取 base64 格式的圖像。我們使用 Selenium 運行這個腳本。

#
javascript code to
get
the image data from canvas
var
canvas
=
document
.
getElementsByClassName
(
"runner-canvas"
)[
0
];
var
img_data
=
canvas
.
toDataURL
()
return
img_data

從 Canvas 提取出的圖像

def grab_screen
(
_driver
=

None
):
image_b64
=
_driver
.
execute_script
(
getbase64Script
)
screen
=
np
.
array
(
Image
.
open
(
BytesIO
(
base64
.
b64decode
(
image_b64
))))
image
=
process_img
(
screen
)#
processing image as required
return
image

處理圖像

獲取得到的原始圖像的解析度大約是 600×150，有 3 個通道（RGB）。我們打算使用 4 個連續截屏作為該模型的單個輸入。這會使得我們的單個輸入的維度高達 600×150×3×4。這樣的計算成本很高，而且並非所有特徵都對玩遊戲有用。所以我們使用 OpenCV 庫對圖像進行尺寸調整、裁剪和處理操作。處理後得到的最終輸入圖像尺寸只有 80×80 像素，且只有單個通道（灰度）。

def process_img
(
image
):
image
=
cv2
.
cvtColor
(
image
,
cv2
.
COLOR_BGR2GRAY
)
image
=
image
[:
300
,

:
500
]
return
image

圖像處理

模型架構

所以我們得到了輸入，並且能使用該模型的輸出來玩遊戲了，現在我們來看看模型的架構。

我們使用了按順序連接的三個卷積層，之後再將它們展平成密集層和輸出層。這個只使用 CPU 的模型不包含池化層，因為我去除了很多特徵，添加池化層會導致已經很稀疏的特徵出現顯著損失。但藉助 GPU，我們可以容納更多特徵，而無需降低幀率。

最大池化層能顯著提升密集特徵集的處理結果。

模型架構

我們的輸出層由兩個神經元組成，每一個都表示每個動作的最大預測獎勵。然後我們選擇有最大獎勵（Q 值）的動作。

def buildmodel
():
print
(
"Now we build the model"
)
model
=

Sequential
()
model
.
add
(
Conv2D
(
32
,

(
8
,

8
),
padding
=
"same"
,
strides
=(
4
,

4
),
input_shape
=(
img_cols
,
img_rows
,
img_channels
)))

#
80
*
80
*
4
model
.
add
(
MaxPooling2D
(
pool_size
=(
2
,
2
)))
model
.
add
(
Activation
(
"relu"
))
model
.
add
(
Conv2D
(
64
,

(
4
,

4
),
strides
=(
2
,

2
),
padding
=
"same"
))
model
.
add
(
MaxPooling2D
(
pool_size
=(
2
,
2
)))
model
.
add
(
Activation
(
"relu"
))
model
.
add
(
Conv2D
(
64
,

(
3
,

3
),
strides
=(
1
,

1
),
padding
=
"same"
))
model
.
add
(
MaxPooling2D
(
pool_size
=(
2
,
2
)))
model
.
add
(
Activation
(
"relu"
))
model
.
add
(
Flatten
())
model
.
add
(
Dense
(
512
))
model
.
add
(
Activation
(
"relu"
))
model
.
add
(
Dense
(
ACTIONS
))
adam
=

Adam
(
lr
=
LEARNING_RATE
)
model
.
compile
(
loss
=
"mse"
,
optimizer
=
adam
)
print
(
"We finish building the model"
)
return
model

訓練

訓練階段發生的事情有這些：

從無動作開始，得到初始狀態（s_t）

觀察 OBSERVATION 步數的玩遊戲過程

預測和執行一個動作

將經歷存儲在重放記憶（Replay Memory）中

從重放記憶隨機選取一批，然後在其上訓練模型

如果遊戲結束，則重新開始

這部分的代碼有點長，但理解起來相當簡單。

def trainNetwork
(
model
,
game_state
):
#
store the previous observations in replay memory
D
=
deque
()

#
experience replay memory
#

get
the first state by doing nothing
do_nothing
=
np
.
zeros
(
ACTIONS
)
do_nothing
[
0
]

=
1

#
0

=>

do
nothing
,
#
1
=>
jump
x_t
,
r_0
,
terminal
=
game_state
.
get_state
(
do_nothing
)

#

get
next step after performing the action
s_t

=
np
.
stack
((
x_t
,

x_t
,

x_t
,

x_t
),
axis
=
2
).
reshape
(
1
,
20
,
40
,
4
)

#
stack
4
images to create placeholder input reshaped
1
*
20
*
40
*
4
OBSERVE
=
OBSERVATION
epsilon
=
INITIAL_EPSILON
t
=

0
while

(
True
):

#
endless running
loss
=

0
Q_sa
=

0
action_index
=

0
r_t

=

0

#
reward at t
a_t

=
np
.
zeros
([
ACTIONS
])

#
action at t
q
=
model
.
predict
(
s_t
)

#
input a stack of
4
images
,

get
the prediction
max_Q
=
np
.
argmax
(
q
)

#
chosing index
with
maximum q value
action_index
=
max_Q
a_t
[
action_index
]

=

1

#
o
=>

do
nothing
,

1
=>
jump
#
run the selected action and observed next state and reward
x_t1
,

r_t
,
terminal
=
game_state
.
get_state
(
a_t
)
x_t1
=
x_t1
.
reshape
(
1
,
x_t1
.
shape
[
0
],
x_t1
.
shape
[
1
],

1
)

#
1x20x40x1
s_t1
=
np
.
append
(
x_t1
,

s_t
[:,

:,

:,

:
3
],
axis
=
3
)

#
append the
new
image to input stack and remove the first one
D
.
append
((
s_t
,
action_index
,

r_t
,
s_t1
,
terminal
))#
store the transition
#
only train
if
done observing
;
sample a minibatch to train on
trainBatch
(
random
.
sample
(
D
,
BATCH
))

if
t
>
OBSERVE
else

0
s_t

=
s_t1
t
+=

1

注意，我們會從重放記憶採樣 32 段隨機經歷，並使用批量方法進行訓練。這麼做的原因是遊戲結構的動作分布不平衡以及為了避免過擬合。

def trainBatch
(
minibatch
):
for
i in range
(
0
,
len
(
minibatch
)):
loss
=

0
inputs
=
np
.
zeros
((
BATCH
,

s_t
.
shape
[
1
],

s_t
.
shape
[
2
],

s_t
.
shape
[
3
]))

#
32
,

20
,

40
,

4
targets
=
np
.
zeros
((
inputs
.
shape
[
0
],
ACTIONS
))

#
32
,

2
state_t

=
minibatch
[
i
][
0
]

#

4D
stack of images
action_t

=
minibatch
[
i
][
1
]

#
This
is action index
reward_t

=
minibatch
[
i
][
2
]

#
reward at
state_t
due to
action_t
state_t1
=
minibatch
[
i
][
3
]

#
next state
terminal
=
minibatch
[
i
][
4
]

#
wheather the agent died or survided due the action
inputs
[
i
:
i
+

1
]

=

state_t
targets
[
i
]

=
model
.
predict
(
state_t
)

#
predicted q values
Q_sa
=
model
.
predict
(
state_t1
)

#
predict q values
for
next step
if
terminal
:
targets
[
i
,

action_t
]

=

reward_t

#

if
terminated
,
only equals reward
else
:
targets
[
i
,

action_t
]

=

reward_t

+
GAMMA
*
np
.
max
(
Q_sa
)
loss
+=
model
.
train_on_batch
(
inputs
,
targets
)