學習筆記TF021：預測編碼、字元級語言建模、ArXiv摘要

最新 06-05

預測編碼(predictive coding)，向RNN輸入大量序列，訓練預測序列下一幀能力。語言建模(language modelling),預測一個句子中下一個單詞的似然。生成文本，依據網路下一個單詞分布抽樣，訓練結束，種子單詞(seed word)送入RNN，觀察預測的下一個單詞，最可能單詞輸入RNN，重複，生成新內容。預測編碼壓縮訓練網路任意序列所有重要信息。網路捕捉語法、語言規則，精確預測是語言下一個字元。

字元級語言建模，網路不僅學會構詞，還學會拼寫，網路輸入維數更低，不必考慮未知單詞，可以發明新單詞。Andrew Karpathy 2015年應用RNN於字元級語言建模。https://github.com/karpathy/char-rnn 。

ArXiv.org託管計算機科學、數學、物理學、生物學等領域研究論文。提供基於Web可檢索文獻API。

依據給定搜索查詢從ArXiv獲取摘要，在構造方法，檢查是否有舊摘要轉儲文件。有，直接使用，不調ArXiv API。執行新查詢，刪除或轉移舊轉儲文件。可以優化檢查已有文件與新類別、新關鍵詞是否匹配。沒有轉儲文件，調_fetch_all，生成行寫入磁碟。

只在Machine Learning、Neural and Evolutionary Computing、Optimization and Control，搜索機器學習論文。只返回包含單詞neural、network、deep元數據結果。

_fetch_all完成分頁。每次查詢，返回定量摘要，指定偏移量獲到指定頁結果。_fetch_page傳入指定頁面尺寸參數。參數很大，嘗試一次性得到全部結果，嚴重影響查詢效率。頁面獲取容錯性更強，減小ArXiv API負載。

抓取結果XML格式，BeautifulSoup庫提取摘要。執行命令 sudo -H pip3 install beautifulsoup4 安裝。查看文章標籤，讀取標籤摘要文本。

定義任務，編寫解析器獲取數據集。預測編碼模型，預測輸入序列下一個字元，只有一個輸入，構造方法sequence參數。參數對象，修改重要選項，復現實驗。initial參數，默認值None，循環連接層初始內部活性值。TensorFlow隱狀態初始化為零張量，語言模型採樣時需要再定義。

數據處理，構造辦玫數據、目標序列，引入時域差。時間步t，St輸入，St+1輸出。提供序列切片，切除第一幀或最後一幀。tf.slice切片運算，參數序列、各維起始索引元組、各維大小元組。sizes-1保持維度起始索引到終止索引所有元素不變。只關心第2維。

mask，尺寸batch_size*max_length張量，分量非0即1，取決幀是否被使用。屬性length沿時間軸對mask求和，得到每個序列長度。mask、length屬性對數據序列合法，與目標序列長度相同，不在數據序列上計算，包含最後一幀，沒有下一字母可預測。數據張量最後一幀切除，包含填序幀，不包含大多數序列實際最後一幀。用mask對代價函數掩膜處理。

同時獲得預測和最後循環活性值。之前僅返回預測值。最後活性值有效生成序列。forward返回兩個張量元組，prediction、state只是方便外部訪問。

每個時間步，模型從辭彙表預測下一字母。分類問題，採用交叉熵代價函數，計算字元預測錯誤率。logprob屬性，刻畫模型對數空間正確下一字母分配概率。變換到對數空間取均值負交叉熵。結果返回線性空間，得到混淆度(perplexity)。混淆度表示模型在每個時間步猜測選項數目。完美模型，混淆度1。每個類別輸出相同概率模型，混淆度為n。如果下一字母零概率，混淆度會變無窮大。預測概率箝位在很小正數和1之間。

固定長度序列，結果tf.reduce_mean。變長序列，與掩碼相乘，屏蔽填充幀，沿幀尺寸聚合，每幀只有一個元素集，tf.reduce_sum聚合各幀為一個標量。

序列實際長度取平均每個序列各幀。使用每個序列長度最大值和1，避免空序列除數為0。tf.reduce_mean取平均批數據樣本。

調用Training(get_params())()。20 epoch 1 小時。20 epochs*200 batches*100 examples*50 characters = 20M個字母。模型在混淆度1.5/字母時收斂。每個字母只需1.5位，可實現文本壓縮。單詞級語言模型，依據單詞數取平均。乘以每個單詞平均字元數。

利用訓練好模型生成新的相似序列。從磁碟載入最新模型檢查點，定義佔位符，數據輸入數據流圖，生成新數據。

構造方法，創建預處理類實例，轉化當前生成序列為NumPy向量，輸入數據流圖。佔位符sequencec預留每批數據一個序列空間。序列長度為2。模型將除最後字元外所有字元作為輸入，除首字元外所有字元作為目標。當前文本最後字元和序列任意第二字元輸入模型。網路為第一字元預測結果，第二字元作目標值。獲取循環神經網路最後活性值，初始化網路下次運行時狀態。模型初始狀態參數，使用過的GRUCell狀態，尺寸rnn_layers*rnn_units向量。

__call__函數，採樣文本序列邏輯。從一個採樣種子開始，每次預測一個字元，當前文本送入網路。相同預處理類轉換當前文本為填充NumPy塊送入網路。批數據只有一個序列和一個輸出幀，只關心索引[0, 0]預測結果。

_sample函數對softmaxl輸出採樣。選取序列最優預測，作為下一幀傳入網路生成序列。實際不是只選擇最可能下一幀，從RNN輸出概率分布隨機採樣。高輸出概率高單詞更可能選中，輸出概率低單詞也可能被選中。

引入溫度參數T，使softmax層輸出分布預測更相似或更不同。在線性空間縮放輸出，變換至指數空間並再次歸一化。運用自然對數撤銷。每個值除以選擇溫度值，得新應用softmax函數。

調用Sampling(get_params())( We , 500) 。捕捉數據內部統計依賴性。

importrequests

importos

frombs4importBeautifulSoup

fromhelpersimportensure_directory

classArxivAbstracts:

ENDPOINT = http://export.arxiv.org/api/query

PAGE_SIZE =100

def__init__(self, cache_dir, categories, keywords, amount=None):

self.categories = categories

self.keywords = keywords

cache_dir = os.path.expanduser(cache_dir)

ensure_directory(cache_dir)

filename = os.path.join(cache_dir, abstracts.txt )

if notos.path.isfile(filename):

withopen(filename, w )asfile_:

forabstractinself._fetch_all(amount):

file_.write(abstract +
)

withopen(filename)asfile_:

self.data = file_.readlines()

def_fetch_all(self, amount):

page_size =type(self).PAGE_SIZE

count =self._fetch_count()

ifamount:

count =min(count, amount)

foroffsetinrange(, count, page_size):

print( Fetch papers {}/{} .format(offset + page_size, count))

yield fromself._fetch_page(page_size, count)

def_fetch_page(self, amount, offset):

url =self._build_url(amount, offset)

response = requests.get(url)

soup = BeautifulSoup(response.text)

forentryinsoup.findAll( entry ):

text = entry.find( summary ).text

text = text.strip().replace(
, )

yieldtext

def_fetch_count(self):

url =self._build_url(,)

response = requests.get(url)

soup = BeautifulSoup(response.text, lxml )

count =int(soup.find( opensearch:totalresults ).string)

print(count, papers found )

returncount

def_build_url(self, amount, offset):

categories = OR .join( cat: + xforxinself.categories)

keywords = OR .join( all: + xforxinself.keywords)

url =type(self).ENDPOINT

url += ?search_query=(({}) AND ({})) .format(categories, keywords)

url += &max_results={}&offset={} .format(amount, offset)

returnurl

importrandom

importnumpyasnp

classPreprocessing:

VOCABULARY =

" $% ()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"

"\^_abcdefghijklmnopqrstuvwxyz{}"

def__init__(self, texts, length, batch_size):

self.texts = texts

self.length = length

self.batch_size = batch_size

self.lookup =

def__call__(self, texts):

batch = np.zeros((len(texts),self.length,len(self.VOCABULARY)))

forindex, textinenumerate(texts):

text = [xforxintextifxinself.lookup]

assert2

foroffset, characterinenumerate(text):

code =self.lookup[character]

batch[index, offset, code] =1

returnbatch

def__iter__(self):

windows = []

fortextinself.texts:

foriinrange(,len(text) -self.length +1,self.length //2):

windows.append(text[i: i +self.length])

assertall(len(x) ==len(windows[])forxinwindows)

while True:

random.shuffle(windows)

foriinrange(,len(windows),self.batch_size):

batch = windows[i: i +self.batch_size]

yieldself(batch)

importtensorflowastf

fromhelpersimportlazy_property

classPredictiveCodingModel:

def__init__(self, params, sequence, initial=None):

self.params = params

self.sequence = sequence

self.initial = initial

self.prediction

self.state

self.cost

self.error

self.logprob

self.optimize

@lazy_property

defdata(self):

max_length =int(self.sequence.get_shape()[1])

returntf.slice(self.sequence, (,,), (-1, max_length -1, -1))

@lazy_property

deftarget(self):

returntf.slice(self.sequence, (,1,), (-1, -1, -1))

@lazy_property

defmask(self):

returntf.reduce_max(tf.abs(self.target),reduction_indices=2)

@lazy_property

deflength(self):

returntf.reduce_sum(self.mask,reduction_indices=1)

@lazy_property

defprediction(self):

prediction, _ =self.forward

returnprediction

@lazy_property

defstate(self):

_, state =self.forward

returnstate

@lazy_property

defforward(self):

cell =self.params.rnn_cell(self.params.rnn_hidden)

cell = tf.nn.rnn_cell.MultiRNNCell([cell] *self.params.rnn_layers)

hidden, state = tf.nn.dynamic_rnn(

inputs=self.data,

cell=cell,

dtype=tf.float32,

initial_state=self.initial,

sequence_length=self.length)

vocabulary_size =int(self.target.get_shape()[2])

prediction =self._shared_softmax(hidden, vocabulary_size)

returnprediction, state

@lazy_property

defcost(self):

prediction = tf.clip_by_value(self.prediction,1e-10,1.0)

cost =self.target * tf.log(prediction)

cost = -tf.reduce_sum(cost,reduction_indices=2)

returnself._average(cost)

@lazy_property

deferror(self):

error = tf.not_equal(

tf.argmax(self.prediction,2), tf.argmax(self.target,2))

error = tf.cast(error, tf.float32)

returnself._average(error)

@lazy_property

deflogprob(self):

logprob = tf.mul(self.prediction,self.target)

logprob = tf.reduce_max(logprob,reduction_indices=2)

logprob = tf.log(tf.clip_by_value(logprob,1e-10,1.0)) / tf.log(2.0)

returnself._average(logprob)

@lazy_property

defoptimize(self):

gradient =self.params.optimizer.compute_gradients(self.cost)

ifself.params.gradient_clipping:

limit =self.params.gradient_clipping

gradient = [

(tf.clip_by_value(g, -limit, limit), v)

ifgis not None else(None, v)

forg, vingradient]

optimize =self.params.optimizer.apply_gradients(gradient)

returnoptimize

def_average(self, data):

data *=self.mask

length = tf.reduce_sum(self.length,)

data = tf.reduce_sum(data,reduction_indices=1) / length

data = tf.reduce_mean(data)

returndata

def_shared_softmax(self, data, out_size):

max_length =int(data.get_shape()[1])

in_size =int(data.get_shape()[2])

weight = tf.Variable(tf.truncated_normal(

[in_size, out_size],stddev=0.01))

bias = tf.Variable(tf.constant(0.1,shape=[out_size]))

# Flatten to apply same weights to all time steps.

flat = tf.reshape(data, [-1, in_size])

output = tf.nn.softmax(tf.matmul(flat, weight) + bias)

output = tf.reshape(output, [-1, max_length, out_size])

returnoutput

importos

importre

importtensorflowastf

importnumpyasnp

fromhelpersimportoverwrite_graph

fromhelpersimportensure_directory

fromArxivAbstractsimportArxivAbstracts

fromPreprocessingimportPreprocessing

fromPredictiveCodingModelimportPredictiveCodingModel

classTraining:

@overwrite_graph

def__init__(self, params, cache_dir, categories, keywords, amount=None):

self.params = params

self.texts = ArxivAbstracts(cache_dir, categories, keywords, amount).data

self.prep = Preprocessing(

self.texts,self.params.max_length,self.params.batch_size)

self.sequence = tf.placeholder(

tf.float32,

[None,self.params.max_length,len(self.prep.VOCABULARY)])

self.model = PredictiveCodingModel(self.params,self.sequence)

self._init_or_load_session()

def__call__(self):

print( Start training )

self.logprobs = []

batches =iter(self.prep)

forepochinrange(self.epoch,self.params.epochs +1):

self.epoch = epoch

for_inrange(self.params.epoch_size):

self._optimization(next(batches))

self._evaluation()

returnnp.array(self.logprobs)

def_optimization(self, batch):

logprob, _ =self.sess.run(

(self.model.logprob,self.model.optimize),

{self.sequence: batch})

ifnp.isnan(logprob):

raiseException( training diverged )

self.logprobs.append(logprob)

def_evaluation(self):

self.saver.save(self.sess, os.path.join(

self.params.checkpoint_dir, model ),self.epoch)

self.saver.save(self.sess, os.path.join(

self.params.checkpoint_dir, model ),self.epoch)

perplexity =2** -(sum(self.logprobs[-self.params.epoch_size:]) /

self.params.epoch_size)

print( Epoch {:2d} perplexity {:5.4f} .format(self.epoch, perplexity))

def_init_or_load_session(self):

self.sess = tf.Session()

self.saver = tf.train.Saver()

checkpoint = tf.train.get_checkpoint_state(self.params.checkpoint_dir)

ifcheckpointandcheckpoint.model_checkpoint_path:

path = checkpoint.model_checkpoint_path

print( Load checkpoint , path)

self.saver.restore(self.sess, path)

self.epoch =int(re.search(r -(d+)$ , path).group(1)) +1

else:

ensure_directory(self.params.checkpoint_dir)

print( Randomly initialize variables )

self.sess.run(tf.initialize_all_variables())

self.epoch =1

fromTrainingimportTraining

fromget_paramsimportget_params

Training(

get_params(),

cache_dir= ./arxiv ,

categories= [

Machine Learning ,

Neural and Evolutionary Computing ,

Optimization

keywords= [

neural ,

network ,

deep

]

)()

importtensorflowastf

importnumpyasnp

fromhelpersimportoverwrite_graph

fromPreprocessingimportPreprocessing

fromPredictiveCodingModelimportPredictiveCodingModel

classSampling:

@overwrite_graph

def__init__(self, params):

self.params = params

self.prep = Preprocessing([],2,self.params.batch_size)

self.sequence = tf.placeholder(

tf.float32, [1,2,len(self.prep.VOCABULARY)])

self.state = tf.placeholder(

tf.float32, [1,self.params.rnn_hidden *self.params.rnn_layers])

self.model = PredictiveCodingModel(

self.params,self.sequence,self.state)

self.sess = tf.Session()

checkpoint = tf.train.get_checkpoint_state(self.params.checkpoint_dir)

ifcheckpointandcheckpoint.model_checkpoint_path:

tf.train.Saver().restore(

self.sess, checkpoint.model_checkpoint_path)

else:

print( Sampling from untrained model. )

print( Sampling temperature ,self.params.sampling_temperature)

def__call__(self, seed, length=100):

text = seed

state = np.zeros((1,self.params.rnn_hidden *self.params.rnn_layers))

for_inrange(length):

feed = {self.state: state}

feed[self.sequence] =self.prep([text[-1] + ? ])

prediction, state =self.sess.run(

[self.model.prediction,self.model.state], feed)

text +=self._sample(prediction[,])

returntext

def_sample(self, dist):

dist = np.log(dist) /self.params.sampling_temperature

dist = np.exp(dist) / np.exp(dist).sum()

choice = np.random.choice(len(dist),p=dist)

choice =self.prep.VOCABULARY[choice]

returnchoice

參考資料：

《面向機器智能的TensorFlow實踐》

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 清醒瘋子 的精彩文章:

TAG:清醒瘋子 |

您可能感興趣

※iPhone 2019新賣點預測：WiFi 6、改進Face ID、三攝像頭
※2019 年科技產業大預測：Micro LED、5G、語音識別都上榜
※Yann LeCun、吳恩達等的2019年AI趨勢預測
※從 iOS 11.3到iOS 12：iPhone 系統更新預測
※蘋果WWDC 2018預測：Siri終於要更新、FaceTime支持群聊
※Wikibon：2018大數據和機器學習的預測
※蘋果新機預測：發布6.1寸版iPhone和6.5寸版iPhone X Plus
※I/O 2018大會下周見，大膽預測 Material Design 2.0 設計語言
※KGI發布蘋果新iPhone價格預測：iPhone X Plus預計1099刀
※Stock Watch的預測《賽博朋克2077》將成為史上銷售最快的RPG
※Doyle預測：全球SD-WAN服務將達100億美元
※12.4 VR掃描：英偉達發布2499美元GPU；CCS Insight預測2018年VR/AR頭顯銷量800萬台
※Gartner：2019數據和分析技術十大趨勢預測
※郭明錤預測2019 iPhone天線軟板材質大改：捨棄LCP材料改採用MPI
※基於突變特徵的BRCA1/2缺陷預測模型-HRDetect
※大膽預測：2018款iPhone X Plus將支持Apple Pencil
※郭銘琪預測：2018 iPad Pro將具備USB-C介面，新版Macbook支持Touch ID指紋認證
※SuperData報告預測2019年Oculus Quest銷量將超過100萬台
※Yann LeCun、吳恩達等多位專家的2019年AI發展預測
※MarketsandMarkets預測：2023年AR市場規模將超600億美元