TensorFlow 數據讀取

知識 06-29

一、使用 placeholder + feed_dict 傳入數據
二、使用 TFRecords 統一輸入數據的格式
0、TFRecords 數據格式的優缺點
1、將數據轉換為 .tfrecords 文件
a、獲得圖片的保存路徑和標籤
b、指定編碼函數
c、將圖片數據和標籤(或其它需要需要保存的數據)都轉成 TFRecods 格式
2、讀取並解碼 .tfrecords 文件並生成 batch
a、指定想要讀取的 .tfrecords 文件列表
b、創建一個輸入文件名隊列來維護輸入文件列表
c、讀取並解碼
3、將 batch 數據喂入計算圖並開始訓練、驗證、測試等
三、參考資料

一、使用 placeholder + feed_dict 傳入數據

placeholder 是 Tensorflow 中的佔位符，必須要指定將傳給該佔位符的值的數據類型 dtype ，一般為 tf.float32 形式；然後通過 sess.run() 的可選參數 feed_dict 為給佔位符喂入實際的數據.eg: sess.run(***, feed_dict={input: **})
input = tf.placeholder(tf.float32, shape=[2], name="my_input")

dtype
：指定了將傳給該佔位符的值的數據類型。該參數是
必須指定

的，因為需要確保不出現類型不匹配的錯誤
shape
：指定了所要傳入的 Tensor 對象的形狀，shape 參數的
默認值為None
，表示可接收任意形狀的Tensor對象
name
：與任何 op 一樣，也可在 tf.placeholder 中指定一個 name 標識符

input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)
output = tf.add(input1, input2)
with tf.Session() as sess:
print sess.run([output], feed_dict={input1:[7.], input2:[2.]})
>>> [array([ 9.], dtype=float32)]
1
2
3
4
5
6
7
8

Note：在 shape 的一個維度上使用 None 可以方便的使用不同 batch 的大小。在訓練時，把數據分成比較小的 batch，但在測試時，可以一次使用全部的數據。但要注意，當數據集比較大時，將大量數據放入一個 batch 可能導致內存溢出

二、使用 TFRecords 統一輸入數據的格式

0、TFRecords 數據格式的優缺點

TFRecord 文件中的數據都是通過 tf.train.Example Protocol Buffer 的格式存儲的，它的優缺點如下所示：
優點：
可以統一不同的原始數據格式
更加有效的管理不同的
屬性
、更好的利用
內存
、更方便的
複製和移動
缺點：
轉換過後 tfrecords 文件會
佔用較大內存

1、將數據轉換為 .tfrecords 文件

TensorFlow 數據讀取

a、獲得圖片的保存路徑和標籤

# 獲得圖片的保存路徑和標籤，以便後面的讀取和轉換
def get_file(file_dir):
"""Get full image directory and corresponding labels
Args:
file_dir: file directory
Returns:
images: image directories, list, string
labels: label, list, int
"""
1
2
3
4
5
6
7
8
9

b、指定編碼函數

tf.train.Example的數據結構中包含了一個從屬性到取值的字典。

屬性名稱(feature name)為一個
字元串
屬性的取值(feature value)可以為
字元串列表
(BytesList)、
實數列表

(FloatList)或者
整數列表
(Int64List)，通過以下函數
編碼
為Example proto形式的返回值

# Wrapper for inserting int64 features into Example proto
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
# Wrapper for inserting bytes features into Example proto
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
1
2
3
4
5
6
7

c、將圖片數據和標籤(或其它需要需要保存的數據)都轉成 TFRecods 格式

指定轉換數據格式後的
保存路徑和文件名稱
創建一個
實例對象 writer
，用於後面
序列化數據
的寫入
將所有數據按照 tf.train.Example Protocol Buffer 的格式存儲
取得圖片的
樣本總數
循環讀取
圖片和標籤的內容
：將圖片內容轉換為字元串型，當有多個標籤時，應將多標籤內容也轉換為字元串型
使用
編碼函數
將一個樣例的所有數據(圖片和標籤內容等)轉換為Example Protocol Buffer
調用實例對象 writer 的
write 方法

將序列化後的 Example Protocol Buffer 寫入 TFRecords 文件
當
所有樣本數據
都轉換完畢時，調用實例對象 writer 的
close 方法
結束寫入過程

import tensorflow as tf
import numpy as np
import os
import skimage.io as io
# 將圖片數據和標籤(或者其它需要需要保存的數據)都轉成 TFRecods 格式的數據
def convert_to_tfrecord(images, labels, save_dir, name):
"""convert all images and labels to one tfrecord file.
Args:
images: list of image directories, string type
labels: list of labels, int type
save_dir: the directory to save tfrecord file, e.g.: "/home/folder1/"
name: the name of tfrecord file, string type, e.g.: "train"
Return:
no return
"""
# 指定數據轉換格式後的保存路徑和名稱
filename = os.path.join(save_dir, name + ".tfrecords")
# 創建一個實例對象 writer，用於後面序列化數據的寫入
writer = tf.python_io.TFRecordWriter(filename)
# 取得圖片的樣本總數
n_samples = len(labels)
print("
Transform start......")
# 將所有數據(包括標籤等)按照 tf.train.Example Protocol Buffer 的格式存儲
for i in np.arange(n_samples):
try:
image = io.imread(images[i]) # read a image, returned image type must be array!
image_raw = image.tostring() # 將圖片矩陣轉化為字元串，tobytes同理
label = int(labels[i]) # 當單個label為字元串時，需要將其轉換為int型
# 創建tf.train.Example 協議內存塊，把標籤、圖片數據作為特定欄位存入（數據類型轉換）
example = tf.train.Example(features=tf.train.Features(feature={
"label": _int64_feature(label),
"image_raw": _bytes_feature(image_raw)}))
# 調用實例對象 writer 的 write 方法將序列化後的 example 協議內存塊寫入 TFRecord 文件
writer.write(example.SerializeToString())
# 跳過不能讀取的圖片
except IOError as e:
print("Could not read:", images[i])
print("error: %s" % e)
print("Skip it!
")
# 調用實例對象 writer 的 close 方法結束寫入過程
writer.close()
print("Transform done!")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

2、讀取並解碼 .tfrecords 文件並生成 batch

A typical pipeline for reading records from files has the following stages:

The list of filenames
Filename queue
Optional filename shuffling
Optional epoch limit
A Reader for the file format
A decoder for a record read by the reader
Optional preprocessing
Example queue

TensorFlow 數據讀取

a、指定想要讀取的 .tfrecords 文件列表

# 直接指定文件列表
filenames = ["/path/to/train_dataset1.tfrecords", "/path/to/train_dataset2.tfrecords"]
# 通過 tf.train.match_filenames_once 函數獲取文件列表
filenames = tf.train.match_filenames_once(os.path.join(FLAGS.data_dir, "train_*.tfrecords"))
# 通過 python 中的 glob 模塊獲取文件列表
filenames = glob.glob(os.path.join(FLAGS.data_dir, "train_*.tfrecords"))
1
2
3
4
5
6
7
8

b、創建一個輸入文件名隊列來維護輸入文件列表

通過tf.train.string_input_producer(filenames, shuffle=True, num_epochs=None)函數來產生輸入文件名隊列
可參考十圖詳解tensorflow數據讀取機制進行理解，如下圖所示，當系統檢測到了「結束」，就會自動拋出一個異常（
OutOfRange
）外部捕捉到這個異常後就可以結束程序了，不過個人理解這裡A、B、C 應該為.tfrecords格式的文件，即類似上面filenames中的內容

TensorFlow 數據讀取

tf.train.string_input_producer(
string_tensor,
num_epochs=None,
shuffle=True,
seed=None,
capacity=32,
shared_name=None,
name=None,
cancel_op=None
)
# 參數
string_tensor: A 1-D string tensor with the strings to produce, 如上面的filenames
num_epochs: An integer (optional). If specified, string_input_producer produces each string from string_tensor num_epochs times before generating an OutOfRange error. If not specified, string_input_producer can cycle through the strings in string_tensor an unlimited number of times.
shuffle: Boolean. If true, the strings are randomly shuffled within each epoch.
capacity: An integer. Sets the queue capacity.
# 返回值
A queue with the output strings. A QueueRunner for the Queue is added to the current Graph"s QUEUE_RUNNER collection.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

c、讀取並解碼

創建一個
實例對象 reader
，用於讀取 .tfrecords中的樣例
調用實例對象 reader 的
read 方法
，讀取文件名隊列中的一個樣例，得到文件名和序列化的 Example Protocol Buffer
按照欄位格式，使用tf.parse_single_example()
解碼器
對上述序列化的 Example Protocol Buffer的一個樣例進行解碼，返回一個 dict(mapping feature keys to Tensor and SparseTensor values)
通過tf.decode_raw()函數將字元串
解析
成圖像對應的像素數組、tf.cast()函數
轉換
標籤的數據類型
圖像預處理
構造批處理器tf.train.shuffle_batch，來產生一個批次的數據，用於神經網路的輸入

def read_and_decode(filenames, batch_size, num_epochs=None):
"""read and decode tfrecord file, generate (image, label) batches
Args:
filenames: the directory of tfrecord filenames, list
batch_size: number of images in each batch
num_epochs: None, cycle through the strings in string_tensor an unlimited number of times
Returns:
image: 4D tensor - [batch_size, width, height, channel]
label: 1D tensor - [batch_size]
"""
# Creates a FIFO queue for holding the filenames until the reader needs them
filename_queue = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True)
# 創建一個實例對象 reader, 用於讀取 TFRecord 中的樣例
reader = tf.TFRecordReader()
# 調用實例對象 reader 的 read 方法，讀取文件名隊列中的一個樣例，得到文件名和序列化的協議內存塊
_, serialized_example = reader.read(filename_queue)
# 按照欄位格式，解析讀入的一個樣例(序列化的協議內存塊)
img_features = tf.parse_single_example(
serialized_example,
features={
"label": tf.FixedLenFeature([], tf.int64),
"image_raw": tf.FixedLenFeature([], tf.string),
})
# 將字元串解析成圖像對應的像素數組 Tensor("DecodeRaw:0", shape=(?,), dtype=uint8)
# 注意：轉成字元串之前是什麼類型的數據，那麼這裡的參數就要填成對應的類型，否則會報錯
image = tf.decode_raw(img_features["image_raw"], tf.uint8)
# Tensor("Cast:0", shape=(), dtype=int32)
label = tf.cast(img_features["label"], tf.int32)
################***** Preprocessing *****####################
# 圖像預處理(resize, reshape, crop, flip, distortion, per_image_standardization ......)
image.set_shape([FLAGS.height, FLAGS.width, FLAGS.depth]) # 將圖片內容轉換成多維數組形式
image = tf.image.resize_images(image, [48, 160]) # 統一圖片的尺寸
...
...
...
############***** 構造批處理器，來產生一個批次的數據 *****##############
# num_threads：可以指定多個線程同時執行入隊操作(數據讀取和預處理)，通過隊列實現多線程處理機制
# capacity: 隊列中最多可以存儲的樣例個數
# min_after_dequeue：限制了出隊時隊列中元素的最少個數，從而保證隨機打亂順序的作用
image_batch, label_batch = tf.train.shuffle_batch([image, label],
batch_size=batch_size,
num_threads=16,
capacity=min_queue_examples + 3 * batch_size,
min_after_dequeue = min_queue_examples)
return image_batch, tf.reshape(label_batch, [batch_size])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

3、將 batch 數據喂入計算圖並開始訓練、驗證、測試等

filenames = tf.train.match_filenames_once(os.path.join(FLAGS.data_dir, "train_*.tfrecords"))
image_batch, label_batch = read_and_decode(filenames, batch_size=BATCH_SIZE)
# tf.train.string_input_producer() 定義了一個局部變數 num_epochs，所以使用前要對其初始化
init = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
with tf.Session() as sess:
sess.run(init)
# 聲明一個 tf.train.Coordinator() 對象來協同多個線程的工作
coord = tf.train.Coordinator()
# 使用 tf.train.start_queue_runners() 之後，才會開始填充隊列
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
# 運行 FLAGS.iteration 個 batch
for itr in range(FLAGS.iteration):
# just plot one batch size
image, label = sess.run([image_batch, label_batch])
plot_images(image, label)
except tf.errors.OutOfRangeError:
print("Done training -- epoch limit reached")
finally:
coord.request_stop() # 通知其它線程退出，同時 corrd.should_stop()被設置成 True
# 等待所有的線程退出
coord.join(threads)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

TensorFlow 數據讀取

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 程序員小新人學習 的精彩文章:

※Linux 常用基本命令 cat grep
※AOP實現日誌記錄（Aspect）

TAG:程序員小新人學習 |