詳解深度強化學習展現TensorFlow 2.0新特性（代碼）

新聞 01-21

【新智元導讀】自TensorFlow官方發布其2.0版本新性能以來，不少人可能對此會有些許困惑。因此博主Roman Ring寫了一篇概述性的文章，通過實現深度強化學習演算法來具體的展示了TensorFlow 2.0的特性。

正所謂實踐出真知。

TensorFlow 2.0的特性公布已經有一段時間了，但很多人對此應當還是一頭霧水。

在本教程中，作者通過深度強化學習(DRL)來展示即將到來的TensorFlow 2.0的特性，具體來講就是通過實現優勢actor-critic(演員-評判家，A2C)智能體來解決經典的CartPole-v0環境。

雖然作者本文的目標是展示TensorFlow 2.0，但他先介紹了DRL方面的內容，包括對該領域的簡要概述。

事實上，由於2.0版本的主要關注點是簡化開發人員的工作，即易用性，所以現在正是使用TensorFlow進入DRL的好時機。

本文完整代碼資源鏈接：

GitHub：https://github.com/inoryy/tensorflow2-deep-reinforcement-learning

Google Colab：https://colab.research.google.com/drive/12QvW7VZSzoaF-Org-u-N6aiTdBN5ohNA

安裝

由於TensorFlow 2.0仍處於試驗階段，建議將其安裝在一個獨立的(虛擬)環境中。我比較傾向於使用Anaconda，所以以此來做說明：

> conda create -n tf2 python=3.6
> source activate tf2
> pip install tf-nightly-2.0-preview # tf-nightly-gpu-2.0-preview for GPU version

讓我們來快速驗證一下，一切是否按著預測正常工作：

>>> import tensorflow as tf
>>> print(tf.__version__)
1.13.0-dev20190117
>>> print(tf.executing_eagerly())
True

不必擔心1.13.x版本，這只是一個早期預覽。此處需要注意的是，默認情況下我們是處於eager模式的！

>>> print(tf.reduce_sum([1, 2, 3, 4, 5]))
tf.Tensor(15, shape=(), dtype=int32)

如果讀者對eager模式並不熟悉，那麼簡單來講，從本質上它意味著計算是在運行時(runtime)被執行的，而不是通過預編譯的圖(graph)來執行。讀者也可以在TensorFlow文檔中對此做深入了解：

https://www.tensorflow.org/tutorials/eager/eager_basics

深度強化學習

一般來說，強化學習是解決順序決策問題的高級框架。RL智能體通過基於某些觀察採取行動來導航環境，並因此獲得獎勵。大多數RL演算法的工作原理是最大化智能體在一個軌跡中所收集的獎勵的總和。

基於RL的演算法的輸出通常是一個策略—一個將狀態映射到操作的函數。有效的策略可以像硬編碼的no-op操作一樣簡單。隨機策略表示為給定狀態下行為的條件概率分布。

詳解深度強化學習展現TensorFlow 2.0新特性（代碼）

Actor-Critic方法

RL演算法通常根據優化的目標函數進行分組。基於值的方法（如DQN）通過減少預期狀態-動作值(state-action value)的誤差來工作。

策略梯度(Policy Gradient)方法通過調整其參數直接優化策略本身，通常是通過梯度下降。完全計算梯度通常是很困難的，所以通常用蒙特卡洛(monte-carlo)方法來估計梯度。

最流行的方法是二者的混合：actor- critical方法，其中智能體策略通過「策略梯度」進行優化，而基於值的方法則用作期望值估計的引導。

深度actor- critical方法

雖然很多基礎的RL理論是在表格案例中開發的，但現代RL幾乎完全是用函數逼近器完成的，例如人工神經網路。具體來說，如果策略和值函數用深度神經網路近似，則RL演算法被認為是「深度的」。

詳解深度強化學習展現TensorFlow 2.0新特性（代碼）

非同步優勢(asynchronous advantage) actor- critical

多年來，為了解決樣本效率和學習過程的穩定性問題，已經為此做出了一些改進。

首先，梯度用回報(return)來進行加權：折現的未來獎勵，這在一定程度上緩解了信用(credit)分配問題，並以無限的時間步長解決了理論問題。

其次，使用優勢函數代替原始回報。收益與基線(如狀態行動估計)之間的差異形成了優勢，可以將其視為與某一平均值相比某一給定操作有多好的衡量標準。

第三，在目標函數中使用額外的熵最大化項，以確保智能體充分探索各種策略。本質上，熵以均勻分布最大化，來測量概率分布的隨機性。

最後，並行使用多個worker來加速樣品採集，同時在訓練期間幫助將它們去相關(decorrelate)。

將所有這些變化與深度神經網路結合起來，我們得到了兩種最流行的現代演算法：非同步優勢actor- critical演算法，或簡稱A3C/A2C。兩者之間的區別更多的是技術上的而不是理論上的：顧名思義，它歸結為並行worker如何估計其梯度並將其傳播到模型中。

詳解深度強化學習展現TensorFlow 2.0新特性（代碼）

有了這些，我將結束我們的DRL方法之旅，因為這篇博客文章的重點是TensorFlow 2.0特性。如果您仍然不確定主題，不要擔心，通過代碼示例，一切都會變得更加清晰明了。

使用TensorFlow 2.0實現Advantage Actor-Critic

讓我們看看實現各種現代DRL演算法的基礎是什麼：是actor-critic agent，如前一節所述。為了簡單起見，我們不會實現並行worker，儘管大多數代碼都支持它。感興趣的讀者可以將這作為一個練習機會。

作為一個測試平台，我們將使用CartPole-v0環境。雖然有點簡單，但它仍然是一個很好的選擇。

通過Keras模型API實現的策略和價值

首先，讓我們在單個模型類下創建策略和價值預估神經網路:

import numpy as np
import tensorflow as tf
import tensorflow.keras.layers as kl
class ProbabilityDistribution(tf.keras.Model):
def call(self, logits):
# sample a random categorical action from given logits
return tf.squeeze(tf.random.categorical(logits, 1), axis=-1)
class Model(tf.keras.Model):
def __init__(self, num_actions):
super().__init__("mlp_policy")
# no tf.get_variable(), just simple Keras API
self.hidden1 = kl.Dense(128, activation="relu")
self.hidden2 = kl.Dense(128, activation="relu")
self.value = kl.Dense(1, name="value")
# logits are unnormalized log probabilities
self.logits = kl.Dense(num_actions, name="policy_logits")
self.dist = ProbabilityDistribution()
def call(self, inputs):
# inputs is a numpy array, convert to Tensor
x = tf.convert_to_tensor(inputs, dtype=tf.float32)
# separate hidden layers from the same input tensor
hidden_logs = self.hidden1(x)
hidden_vals = self.hidden2(x)
return self.logits(hidden_logs), self.value(hidden_vals)
def action_value(self, obs):
# executes call() under the hood
logits, value = self.predict(obs)
action = self.dist.predict(logits)
# a simpler option, will become clear later why we don"t use it
# action = tf.random.categorical(logits, 1)
return np.squeeze(action, axis=-1), np.squeeze(value, axis=-1)

然後驗證模型是否如預期工作：

import gym
env = gym.make("CartPole-v0")
model = Model(num_actions=env.action_space.n)
obs = env.reset()
# no feed_dict or tf.Session() needed at all
action, value = model.action_value(obs[None, :])
print(action, value) # [1] [-0.00145713]

這裡需要注意的是：

模型層和執行路徑是分別定義的
沒有「輸入」層，模型將接受原始numpy數組
通過函數API可以在一個模型中定義兩個計算路徑
模型可以包含一些輔助方法，比如動作採樣
在eager模式下，一切都可以從原始numpy數組中運行

Random Agent

現在讓我們轉到 A2CAgent 類。首先，讓我們添加一個 test 方法，該方法運行完整的episode並返回獎勵的總和。

class A2CAgent:
def __init__(self, model):
self.model = model
def test(self, env, render=True):
obs, done, ep_reward = env.reset(), False, 0
while not done:
action, _ = self.model.action_value(obs[None, :])
obs, reward, done, _ = env.step(action)
ep_reward += reward
if render:
env.render()
return ep_reward

讓我們看看模型在隨機初始化權重下的得分：

agent = A2CAgent(model)
rewards_sum = agent.test(env)
print("%d out of 200" % rewards_sum) # 18 out of 200

離最佳狀態還很遠，接下來是訓練部分!

損失/目標函數

正如我在DRL概述部分中所描述的，agent通過基於某些損失(目標)函數的梯度下降來改進其策略。在 actor-critic 中，我們針對三個目標進行訓練：利用優勢加權梯度加上熵最大化來改進策略，以及最小化價值估計誤差。

import tensorflow.keras.losses as kls
import tensorflow.keras.optimizers as ko
class A2CAgent:
def __init__(self, model):
# hyperparameters for loss terms
self.params = {"value": 0.5, "entropy": 0.0001}
self.model = model
self.model.compile(
optimizer=ko.RMSprop(lr=0.0007),
# define separate losses for policy logits and value estimate
loss=[self._logits_loss, self._value_loss]
)
def test(self, env, render=True):
# unchanged from previous section
...
def _value_loss(self, returns, value):
# value loss is typically MSE between value estimates and returns
return self.params["value"]*kls.mean_squared_error(returns, value)
def _logits_loss(self, acts_and_advs, logits):
# a trick to input actions and advantages through same API
actions, advantages = tf.split(acts_and_advs, 2, axis=-1)
# polymorphic CE loss function that supports sparse and weighted options
# from_logits argument ensures transformation into normalized probabilities
cross_entropy = kls.CategoricalCrossentropy(from_logits=True)
# policy loss is defined by policy gradients, weighted by advantages
# note: we only calculate the loss on the actions we"ve actually taken
# thus under the hood a sparse version of CE loss will be executed
actions = tf.cast(actions, tf.int32)
policy_loss = cross_entropy(actions, logits, sample_weight=advantages)
# entropy loss can be calculated via CE over itself
entropy_loss = cross_entropy(logits, logits)
# here signs are flipped because optimizer minimizes
return policy_loss - self.params["entropy"]*entropy_loss

我們完成了目標函數！注意代碼非常緊湊：注釋行幾乎比代碼本身還多。

Agent Training Loop

最後，還有訓練環路。它有點長，但相當簡單：收集樣本，計算回報和優勢，並在其上訓練模型。

class A2CAgent:
def __init__(self, model):
# hyperparameters for loss terms
self.params = {"value": 0.5, "entropy": 0.0001, "gamma": 0.99}
# unchanged from previous section
...
def train(self, env, batch_sz=32, updates=1000):
# storage helpers for a single batch of data
actions = np.empty((batch_sz,), dtype=np.int32)
rewards, dones, values = np.empty((3, batch_sz))
observations = np.empty((batch_sz,) + env.observation_space.shape)
# training loop: collect samples, send to optimizer, repeat updates times
ep_rews = [0.0]
next_obs = env.reset()
for update in range(updates):
for step in range(batch_sz):
observations[step] = next_obs.copy()
actions[step], values[step] = self.model.action_value(next_obs[None, :])
next_obs, rewards[step], dones[step], _ = env.step(actions[step])
ep_rews[-1] += rewards[step]
if dones[step]:
ep_rews.append(0.0)
next_obs = env.reset()
_, next_value = self.model.action_value(next_obs[None, :])
returns, advs = self._returns_advantages(rewards, dones, values, next_value)
# a trick to input actions and advantages through same API
acts_and_advs = np.concatenate([actions[:, None], advs[:, None]], axis=-1)
# performs a full training step on the collected batch
# note: no need to mess around with gradients, Keras API handles it
losses = self.model.train_on_batch(observations, [acts_and_advs, returns])
return ep_rews
def _returns_advantages(self, rewards, dones, values, next_value):
# next_value is the bootstrap value estimate of a future state (the critic)
returns = np.append(np.zeros_like(rewards), next_value, axis=-1)
# returns are calculated as discounted sum of future rewards
for t in reversed(range(rewards.shape[0])):
returns[t] = rewards[t] + self.params["gamma"] * returns[t+1] * (1-dones[t])
returns = returns[:-1]
# advantages are returns - baseline, value estimates in our case
advantages = returns - values
return returns, advantages
def test(self, env, render=True):
# unchanged from previous section
...
def _value_loss(self, returns, value):
# unchanged from previous section
...
def _logits_loss(self, acts_and_advs, logits):
# unchanged from previous section
...

訓練&結果

我們現在已經準備好在CartPole-v0上訓練這個single-worker A2C agent！訓練過程應該只用幾分鐘。訓練結束後，你應該看到一個智能體成功地實現了200分的目標。

rewards_history = agent.train(env)
print("Finished training, testing...")
print("%d out of 200" % agent.test(env)) # 200 out of 200

詳解深度強化學習展現TensorFlow 2.0新特性（代碼）

在源代碼中，我包含了一些額外的幫助程序，可以列印出正在運行的episode的獎勵和損失，以及rewards_history。

詳解深度強化學習展現TensorFlow 2.0新特性（代碼）

靜態計算圖

eager mode效果這麼好，你可能會想知道靜態圖執行是否也可以。當然是可以！而且，只需要多加一行代碼就可以啟用靜態圖執行。

with tf.Graph().as_default():
print(tf.executing_eagerly()) # False
model = Model(num_actions=env.action_space.n)
agent = A2CAgent(model)
rewards_history = agent.train(env)
print("Finished training, testing...")
print("%d out of 200" % agent.test(env)) # 200 out of 200

有一點需要注意的是，在靜態圖執行期間，我們不能只使用 Tensors，這就是為什麼我們需要在模型定義期間使用CategoricalDistribution的技巧。

One More Thing…

還記得我說過TensorFlow在默認情況下以eager 模式運行，甚至用一個代碼片段來證明它嗎？好吧,我騙了你。

如果你使用Keras API來構建和管理模型，那麼它將嘗試在底層將它們編譯為靜態圖。所以你最終得到的是靜態計算圖的性能，它具有eager execution的靈活性。

你可以通過model.run_eager標誌檢查模型的狀態，還可以通過將此標誌設置為True來強制使用eager mode，儘管大多數情況下可能不需要這樣做——如果Keras檢測到沒有辦法繞過eager mode，它將自動退出。

為了說明它確實是作為靜態圖運行的，這裡有一個簡單的基準測試：

# create a 100000 samples batch
env = gym.make("CartPole-v0")
obs = np.repeat(env.reset()[None, :], 100000, axis=0)

Eager Benchmark

%%time
model = Model(env.action_space.n)
model.run_eagerly = True
print("Eager Execution: ", tf.executing_eagerly())
print("Eager Keras Model:", model.run_eagerly)
_ = model(obs)
######## Results #######
Eager Execution: True
Eager Keras Model: True
CPU times: user 639 ms, sys: 736 ms, total: 1.38 s

Static Benchmark

%%time
with tf.Graph().as_default():
model = Model(env.action_space.n)
print("Eager Execution: ", tf.executing_eagerly())
print("Eager Keras Model:", model.run_eagerly)
_ = model.predict(obs)
######## Results #######
Eager Execution: False
Eager Keras Model: False
CPU times: user 793 ms, sys: 79.7 ms, total: 873 ms

Default Benchmark

%%time
model = Model(env.action_space.n)
print("Eager Execution: ", tf.executing_eagerly())
print("Eager Keras Model:", model.run_eagerly)
_ = model.predict(obs)
######## Results #######
Eager Execution: True
Eager Keras Model: False
CPU times: user 994 ms, sys: 23.1 ms, total: 1.02 s

正如你所看到的，eager模式位於靜態模式之後，默認情況下，模型確實是靜態執行的。

結論

希望本文對理解DRL和即將到來的TensorFlow 2.0有所幫助。請注意，TensorFlow 2.0仍然只是預覽版的，一切都有可能發生變化，如果你對TensorFlow有什麼特別不喜歡(或喜歡:))的地方，請反饋給開發者。

一個總被提起的問題是，TensorFlow是否比PyTorch更好？也許是，也許不是。兩者都是很好的庫，所以很難說是哪一個更好。如果你熟悉PyTorch，你可能會注意到TensorFlow 2.0不僅趕上了它，而且還避免了PyTorch API的一些缺陷。

無論最後誰勝出，對於開發者來說，這場競爭給雙方都帶來了凈積極的結果，我很期待看到這些框架未來會變成什麼樣子。

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 新智元 的精彩文章:

※禁售iPhone再升級！高通尋求美國禁止進口蘋果，5G大戰英特爾躺槍
※華為回應CFO被捕：相信法律最終給出公正結論

TAG:新智元 |