logo

TensorFlow 数据集

王哲峰 / 2022-07-05


目录

TensorFlow Dataset

TensorFlow Datasets 库

$ pip install tensorflow
$ pip install tensorflow-datasets
# tf.data, tf.data.Dataset, tf.data.Iterator
import tensorflow as tf

# tf.keras.datasets.<dataset_name>.load_data
from tensorflow.keras import datasets

# tfds.load
import tensorflow_datasets as tfds

TensorFlow Dataset API

tf.data

TensorFlow 提供了 tf.data 模块, 它包括了一套灵活的数据集构建 API, 能够帮助快速、高效地构建数据输入的管道, 尤其适用于数据量巨大的情景

tf.data API 在 TensorFlow 中引入了两个新的抽象类:

tensorflow_datasets

TensorFlow Datasets(tensorflow_datasets) 是可用于 TensorFlow 或其他 Python 机器学习框架(例如 Jax) 的一系列数据集。 所有数据集都作为 tf.data.Dataset 提供, 实现易用且高性能的输入管道

tf.keras.datasets

TensorFlow Dataset 建立

tf.data.Dataset.from_tensor_slices

建立 tf.data.Dataset 的最基本的方法是使用 tf.data.Dataset.from_tensor_slices()

import tensorflow as tf
import numpy as np

X = tf.constant([2013, 2014, 2015, 2016, 2017])
Y = tf.constant([12000, 14000, 15000, 16500, 17500])

dataset = tf.data.Dataset.from_tensor_slices((X, Y))
for x, y in dataset:
    print(x.numpy(), y.numpy())

tf.data.Dataset.from_tensor_slices 和 tf.keras.datasets.mnist.load_data

import tensorflow as tf
import matplotlib.pyplot as plt

(train, train_label), (test, test_label) = tf.keras.datasets.mnist.load_data()
train = np.expand_dim(
    train.astype(np.float32) / 255, 
    axis = -1
)
mnist_dataset = tf.data.Dataset.from_tensor_slices(
    (train, train_label)
)

for image, label in mnist_dataset.take(1):
    plt.title(label.numpy())
    plt.imshow(image.numpy())
    plt.show()

tensorflow_datasets

TensorFlow Datasets 提供了一系列可以和 TensorFlow 配合使用的数据集, 它负责下载和准备数据, 以及构建 tf.data.Dataset。 每一个数据集(dataset) 都实现了抽象基类 tfds.core.DatasetBuilder 来构建

import tensorflow_datasets as tfds

# 构建 tf.data.Dataset
dataset1 = tfds.load(
    "mnist", 
    split = "train", 
    shuffle_files = True
)
dataset2 = tfds.load(
    "mnist", 
    split = tfds.Split.TRAIN, 
    as_supervised = True
)

# 构建输入数据 Pipeline
dataset1 = dataset1 \
    .shuffle(1024) \
    .batch(32) \
    .prefetch(tf.data.experimential.AUTOTUNE)

for example in dataset1.take(1):
    image, label = example["image"], example["label"]

TFRecord

对于特别巨大而无法完整载入内存的数据集, 可以先将数据集处理为 TFRecord 格式, 然后使用 tf.data.TFRecordDataset() 进行载入

TensorFlow 内置 Dataset

TensorFlow Datasets 提供了一系列可以和 TensorFlow 配合使用的数据集, 它负责下载和准备数据, 以及构建 tf.data.Dataset

每一个数据集(dataset) 都实现了抽象基类 tfds.core.DatasetBuilder 来构建

官方文档

查看可用的数据集

import tensorflow as tf
import tensorflow_datasets as tfds

# 所有可用的数据集
print(tfds.list_builders()) 

['abstract_reasoning', 'aflw2k3d', 'amazon_us_reviews', 
'bair_robot_pushing_small', 'bigearthnet', 'binarized_mnist', 'binary_alpha_digits', 
'caltech101', 'caltech_birds2010', 'caltech_birds2011', 'cats_vs_dogs', 'celeb_a', 'celeb_a_hq', 'chexpert', 'cifar10', 'cifar100', 'cifar10_corrupted', 'clevr', 'cnn_dailymail', 'coco', 'coco2014', 'coil100', 'colorectal_histology', 'colorectal_histology_large', 'curated_breast_imaging_ddsm', 'cycle_gan', 
'deep_weeds', 'definite_pronoun_resolution', 'diabetic_retinopathy_detection', 'downsampled_imagenet', 'dsprites', 'dtd', 'dummy_dataset_shared_generator', 'dummy_mnist', 
'emnist', 'eurosat', 
'fashion_mnist', 'flores', 'food101', 
'gap', 'glue', 'groove', 
'higgs', 'horses_or_humans', 
'image_label_folder', 'imagenet2012', 'imagenet2012_corrupted', 'imdb_reviews', 'iris', 'kitti', 
'kmnist', 
'lfw', 'lm1b', 'lsun', 
'mnist', 'mnist_corrupted', 'moving_mnist', 'multi_nli', 
'nsynth', 
'omniglot', 'open_images_v4', 'oxford_flowers102', 'oxford_iiit_pet', 
'para_crawl', 'patch_camelyon', 'pet_finder', 'quickdraw_bitmap', 
'resisc45', 'rock_paper_scissors', 'rock_you', 
'scene_parse150', 'shapes3d', 'smallnorb', 'snli', 'so2sat', 'squad', 'stanford_dogs', 'stanford_online_products', 'starcraft_video', 'sun397', 'super_glue', 'svhn_cropped', 
'ted_hrlr_translate', 'ted_multi_translate', 'tf_flowers', 'titanic', 'trivia_qa', 
'uc_merced', 'ucf101', 
'visual_domain_decathlon', 'voc2007', 
'wikipedia', 'wmt14_translate', 'wmt15_translate', 'wmt16_translate', 'wmt17_translate', 'wmt18_translate', 'wmt19_translate', 'wmt_t2t_translate', 'wmt_translate', 
'xnli']

内置数据集分类

构建并加载内置数据集

示例 1:

mnist_train = tfds.load(
    "mnist", 
    split = "train", 
    download = False, 
    data_dir = "~/.tensorflow_datasets/"
)
assert isinstance(mnist_train, tf.data.Dataset)
print(mnist_train)

示例 2: 版本控制

mnist_train = tfds.load(
    "mnist:1.*.*", 
    split = "train", 
    download = False, 
    data_dir = "~/.tensorflow_datasets/"
)
assert isinstance(mnist_train, tf.data.Dataset)
print(mnist_train)

内置数据集特征字典

所有 tensorflow_datasets 数据集都包含将特征名称映射到 Tensor 值的特征字典。 典型的数据集将具有 2 个键:

示例:

mnist_train = tfds.load(
    "mnist", 
    split = "train", 
    download = False, 
    data_dir = "~/.tensorflow_datasets/"
)
for mnist_example in mnist_train.take(1):
   image, label = mnist_example["image"], mnist_example["label"]
   plt.imshow(
      image.numpy()[:, :, 0].astype(np.float32),
      cma = plt.get_cmap("gray")
   )
   print("Label: %d" % label.numpy())
   plt.show()

DatasetBuilder

tensorflow_datasets.load 实际上是一个基于 DatasetBuilder 的简单方便的包装器

示例:

mnist_builder = tfds.builder("mnist")
mnsit_builder.download_and_prepare()
mnist_train = mnist_builder.as_dataset(split = "train")
mnist_train

内置数据集输入管道

一旦有了 tf.data.Dataset 对象, 就可以使用 tf.data 接口定义适合模型训练的输入管道的其余部分

示例:

mnist_train = mnist_train
    .repeat() \
    .shuffle(1024) \
    .batch(32)

# prefetch 将使输入管道可以在模型训练时一步获取批处理
mnist_train = mnist_train \
    .repeat() \
    .shuffle(1024) \
    .batch(32) \
    .prefetch(tf.data.experimental.AUTOTUNE)

内置数据集信息

示例:

# method 1
mnist_builder = tfds.builder("mnist")
info = mnist_builder.info

print(info)
print(info.features)
print(info.features["label"].num_classes)
print(info.features["label"].names)
# method 2
mnist_test, info = tfds.load(
    "mnist", 
    split = "test", 
    with_info = True
)
print(info)

内置数据集可视化

示例:

fig = tfds.show_examples(info, mnist_test)

TensorFlow Dataset 预处理

数据集预处理 API 介绍

tf.data.Dataset

tf.data.Dataset 类提供了多种数据集预处理方法:

Sequence Preprocessing

Text Preprocessing

Image Preprocessing

数据集处理示例

tf.data.Dataset.map()

使用 tf.data.Dataset.map() 将所有图片旋转 90 度

import tensorflow as tf

# data preprocessing function
def rot90(image, label):
   image = tf.image.rot90(image)
   return image, label

# data
mnist_dataset = tf.keras.datasets.mnist.load_data()

# data preprocessing
mnist_dataset = mnist_dataset.map(rot90)

# data visual
for image, label in mnist_dataset:
   plt.title(label.numpy())
   plt.imshow(image.numpy()[:, :, 0])
   plt.show()

tf.data.Dataset.batch()

使用 tf.data.Dataset.batch() 将数据集划分为批次, 每个批次的大小为 4

import tensorflow as tf

# data
mnist_dataset = tf.keras.datasets.mnist.load_data()

# data preprocessing
mnist_dataset = mnist_dataset.batch(4)

# data visual
# image: [4, 28, 28, 1], labels: [4]
for images, labels in mnist_dataset:
   fig, axs = plt.subplots(1, 4)
   for i in range(4):
      axs[i].set_title(label.numpy()[i])
      axs[i].imshow(images.numpy()[i, :, :, 0])
   plt.show()

tf.data.Dataset.shuffle()

使用 tf.data.Dataset.shuffle() 将数据打散后再设置批次, 缓存大小设置为 10000

一般而言, 若数据集的顺序分布较为随机, 则缓冲区的大小可较小, 否则需要设置较大的缓冲区

import tensorflow as tf

# data
mnist_dataset = tf.keras.datasets.mnist.load_data()

# data preprocessing
mnist_dataset = mnist_dataset.shuffle(buffer_size = 10000).batch(4)

# data visual
# image: [4, 28, 28, 1], labels: [4]
for i in range(2):
   for images, labels in mnist_dataset:
      fig, axs = plt.subplots(1, 4)
      for i in range(4):
         axs[i].set_title(label.numpy()[i])
         axs[i].imshow(images.numpy()[i, :, :, 0])
      plt.show()

tf.data.Dataset.prefetch()

使用 tf.data.Dataset.prefetch() 并行化策略提高训练流程效率

import tensorflow as tf

# data preprocessing function
def rot90(image, label):
   image = tf.image.rot90(image)
   return image, label

# data
mnist_dataset = tf.keras.datasets.mnist.load_data()

# data preprocessing
# 开启数据预加载功能
mnist_dataset = mnist_dataset.prefetch(
    buffer_size = tf.data.experimental.AUTOTUNE
)

# 利用多 GPU 资源, 并行化地对数据进行变换
mnist_dataset = mnist_dataset.map(
    map_func = rot90, 
    num_parallel_calls = 2
)
mnist_dataset = mnist_dataset.map(
    map_func = rot90, 
    num_parallel_calls = tf.data.experimental.AUTOTUNE
)

tf.data.Dataset for and iter

获取与使用 tf.data.Dataset 数据集元素

dataset = tf.data.Dataset.from_tensor_slices((A, B, C, ...))
for a, b, c ... in dataset:
   pass
dataset = tf.data.Dataset.from_tensor_slices((A, B, C, ...))
it = iter(dataset)
a_0, b_0, c_0, ... = next(it)
a_1, b_1, c_1, ... = next(it)

图像

tf.keras.preprocessing.imgae.ImageDataGenerator 通过实时数据增强生成批量张量图像数据

tf.keras.preprocessing.image.ImageDataGenerator(
    featurewise_center = False, # 将数据的特征均值设定为 0
    samplewise_center = False,  # 将数据的样本均值设定为 0
    featurewise_std_normalization = False, # 是否将特征除以特征的标准差进行归一化
    samplewise_std_normalization = False,  # 是否将样本除以样本的标准差进行归一化
    zca_whitening = False, # 是否进行 ZCA 白化
    zca_epsilon = 1e-06,  # 进行 ZCA 白化的 epsilon参数
    rotation_range = 0,  # 随机旋转的角度范围
    width_shift_range = 0.0,  # 宽度调整的范围
    height_shift_range = 0.0,  # 高度调整的范围
    brightness_range = None,  # 亮度范围 
    shear_range = 0.0,  # 剪切范围
    zoom_range = 0.0,  # 缩放范围
    channel_shift_range = 0.0,  # 通道调整范围
    fill_mode = 'nearest',  # 填充边界之外点的方式:
    cval = 0.0, 
    horizontal_flip = False,  # 水平翻转
    vertical_flip = False,  # 垂直翻转
    rescale = None,
    preprocessing_function = None, 
    data_format = None, 
    validation_split = 0.0,
    dtype = None,
)
from keras.datasets import cifar10
from keras import utils
from keras.preprocessing.image import ImageDataGenerator

# model training parameters
num_classes = 10
data_augmentation = True
batch_size = 32
epochs = 20

# data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_train /= 255
x_test /= 255
y_train = utils.to_categorical(y_train, num_classes = num_classes)
y_test = utils.to_categorical(y_test, num_classes = num_classes)

# model training
if not data_augmentation:
    print("Not using data augmentation.")
    model.fit(
        x_train, y_train,
        batch_size = batch_size,
        epochs = epochs,
        validation_data = (x_test, y_test),
        shuffle = True
    )
else:
    print("Using real-time data augmentation.")
    # This will do preprocessing and realtime data augmentation:
    datagen = ImageDataGenerator(
        featurewise_center = False,
        samplewise_center = False,
        featurewise_std_normalization = False,
        samplewise_std_normalization = False,
        zca_whitening = False,
        zca_epsilon = 1e-6,
        rotation_range = 0,
        width_shift_range = 0.1,
        height_shift_range = 0.1,
        shear_range = 0.,
        zoom_range = 0.,
        channel_shift_range = 0,
        fill_mode = "nearest",
        cval = 0.,
        horizontal_flip = True,
        vertical_flip = False,
        rescale = None,
        preprocessing_function = None,
        data_format = None,
        validation_split = 0.0
    )
    datagen.fit(x_train)
    model.fit_generator(datagen.flow(
        x_train,
        y_train,
        batch_size = batch_size,
        epochs = epochs,
        validation_data = (x_test, y_test),
        workers = 4
    ))
from keras.datasets import cifar10
from keras import utils

# data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_train /= 255
x_test /= 255
y_train = utils.to_categorical(y_train, num_classes = num_classes)
y_test = utils.to_categorical(y_test, num_classes = num_classes)


# model training parameters
batch_size = 32
epochs = 20
num_classes = 10
data_augmentation = True

# model training
datagen = ImageDataGenerator(featurewise_center = True,
                              featurewise_std_normalization = True,
                              rotation_range = 20,
                              width_shift_range = 0.2,
                              height_shift_range = 0.2,
                              horizontal_flip = True)

for e in range(epochs):
      print("Epoch", e)
      batches = 0
      for x_batch, y_batch in datagen.flow(x_train, y_train, batch_size = batch_size):
         model.fit(x_batchd, y_batch)
         batches += 1
         if batches >= len(x_train) / 32:
            break
train_datagen = ImageDataGenerator(rescale = 1.0 / 255,
                                    shear_range = 0.2,
                                    zoom_range = 0.2,
                                    horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1.0 / 255)

train_generator = train_datagen \
      .flow_from_directory("data/train",
                           target_size = (150, 150),
                           batch_size = 32,
                           class_mode = "binary")
validation_generator = test_datagen \
      .flow_from_directory("data/validation",
                           target_size = (150, 150),
                           batch_size = 32,
                           class_mode = "binary")

model.fit_generator(train_generator,
                     steps_per_epoch = 2000,
                     epochs = 50,
                     validation_data = validation_generator,
                     validation_steps = 800)
# we create two instances with the same arguments
data_gen_args = dict(featurewise_center=True,
                     featurewise_std_normalization=True,
                     rotation_range=90,
                     width_shift_range=0.1,
                     height_shift_range=0.1,
                     zoom_range=0.2)
image_datagen = ImageDataGenerator(**data_gen_args)
mask_datagen = ImageDataGenerator(**data_gen_args)

# Provide the same seed and keyword arguments to the fit and flow methods
seed = 1
image_datagen.fit(images, augment=True, seed=seed)
mask_datagen.fit(masks, augment=True, seed=seed)

image_generator = image_datagen.flow_from_directory(
      'data/images',
      class_mode=None,
      seed=seed)

mask_generator = mask_datagen.flow_from_directory(
      'data/masks',
      class_mode=None,
      seed=seed)

# combine generators into one which yields image and masks
train_generator = zip(image_generator, mask_generator)

model.fit_generator(
      train_generator,
      steps_per_epoch=2000,
      epochs=50)

文本

tf.data.TextLineDataset 通常被用来以文本文件构建数据集(原文件中的一行为一个样本)。 这适用于大多数的基于行的文本数据(例如, 诗歌或错误日志)

import os
import tensorflow as tf
import tensorflow_datasets as tfds

DIRECTORY_URL = "https://storage.googleapis.com/download.tensorflow.org/data/illiad/"
FILE_NAMES = ["cowper.txt", "derby.txt", "butler.txt"]
for name in FILE_NAMES:
   text_dir = tf.keras.utils.get_file(name, origin = DIRECTORY_URL + name)

def labeler(example, index):
   return example, tf.cast(index, tf.int64)

parent_dir = os.path.dirname(text_dir)
labeled_data_sets = []
for i, file_name in enumerate(FILE_NAMES):
    lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
    labeled_data_sets.append(labeled_dataset)

BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration = False)

for ex in all_labeled_data.take(5):
    print(ex)

Unicode

TF.Text

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence

数据管道构建

可以从 Numpy array、Pandas DataFrame、Python generator、 CSV 文件、文本文件、文件路径、TFRecord 文件等方式构建数据管道。

其中:

Numpy array 构建数据管道

import numpy as np
from sklearn import datasets
import tensorflow as tf


iris = datasets.load_iris()

iris_dataset = tf.data.Dataset.from_tensor_slices(
    (iris["data"], iris["target"])
)

for features, label in iris_dataset.take(5):
    print([features, label])
[
    <tf.Tensor: shape=(4,), dtype=float64, numpy=array([5.1, 3.5, 1.4, 0.2])>, 
    <tf.Tensor: shape=(), dtype=int64, numpy=0>
]
[
    <tf.Tensor: shape=(4,), dtype=float64, numpy=array([4.9, 3. , 1.4, 0.2])>, 
    <tf.Tensor: shape=(), dtype=int64, numpy=0>
]
[
    <tf.Tensor: shape=(4,), dtype=float64, numpy=array([4.7, 3.2, 1.3, 0.2])>, 
    <tf.Tensor: shape=(), dtype=int64, numpy=0>
]
[
    <tf.Tensor: shape=(4,), dtype=float64, numpy=array([4.6, 3.1, 1.5, 0.2])>, 
    <tf.Tensor: shape=(), dtype=int64, numpy=0>
]
[
    <tf.Tensor: shape=(4,), dtype=float64, numpy=array([5. , 3.6, 1.4, 0.2])>, 
    <tf.Tensor: shape=(), dtype=int64, numpy=0>
]

Pandas DataFrame 构建数据管道

import pandas as pd
from sklearn import datasets
import tensorflow as tf

iris = datasets.load_iris()
iris_df = pd.DataFrame(
    iris["data"],
    columns = iris.feature_names,
)

iris_datasets = tf.data.Dataset.from_tensor_slices(
    (iris_df.to_dict("list"), iris["target"])
)

for features, label in iris_dataset.take(3):
    print([features, label])
[
    {
        'sepal length (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=5.1>, 
        'sepal width (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=3.5>, 
        'petal length (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=1.4>, 
        'petal width (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=0.2>
    }, 
    <tf.Tensor: shape=(), dtype=int64, numpy=0>
]
[
    {
        'sepal length (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=4.9>, 
        'sepal width (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=3.0>, 
        'petal length (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=1.4>, 
        'petal width (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=0.2>
    }, 
    <tf.Tensor: shape=(), dtype=int64, numpy=0>
]
[
    {
        'sepal length (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=4.7>, 
        'sepal width (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=3.2>, 
        'petal length (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=1.3>, 
        'petal width (cm)': <tf.Tensor: shape=(), dtype=float32, numpy=0.2>
    }, 
    <tf.Tensor: shape=(), dtype=int64, numpy=0>
]

Python Generator 构建数据管道


CSV 构建数据管道


文本文件构建数据管道

dataset = tf.data.TextLineDataset(
    filenames = [
        "./data/titanic/train.csv",
        "./data/titanic/test.csv",
    ],
).skip(1)  # 略去第一行 header

for line in dataset.take(5):
    print(lien)
tf.Tensor(b'493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5,C30,S', shape=(), dtype=string)
tf.Tensor(b'53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C', shape=(), dtype=string)
tf.Tensor(b'388,1,2,"Buss, Miss. Kate",female,36.0,0,0,27849,13.0,,S', shape=(), dtype=string)
tf.Tensor(b'192,0,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S', shape=(), dtype=string)
tf.Tensor(b'687,0,3,"Panula, Mr. Jaako Arnold",male,14.0,4,1,3101295,39.6875,,S', shape=(), dtype=string)

文件路径构建数据管道


TFRecord 文件构建数据管道

TFRecord 数据文件介绍

TFRecord 是 TensorFlow 中的数据集存储格式。当将数据集整理成 TFRecord 格式后, TensorFlow 就可以高效地读取和处理这些数据集了。从而帮助更高效地进行大规模模型训练

TFRecord 可以理解为一系列序列化的 tf.train.Example 元素所组成的列表文件, 而每一个 tf.train.Example 又由若干个 tf.train.Feature 的字典组成

# dataset.tfrecords
[
{  # example 1 (tf.train.Example)
    'feature_1': tf.train.Feature,
    ...
    'feature_k': tf.train.Feature,
},
...
{  # example N (tf.train.Example)
    'feature_1': tf.train.Feature,
    ...
    'feature_k': tf.train.Feature,
}, 
]

TFRecord 文件保存

TFRecord 文件保存步骤

为了将形式各样的数据集整理为 TFRecord 格式, 可以对数据集中的每个元素进行以下步骤:

  1. 读取该数据元素到内存
  2. 将该元素转换为 tf.train.Example 对象
    • 每个 tf.train.Example 对象由若干个 tf.train.Feature 的字典组成, 因此需要先建立 Feature 的子典
  3. tf.train.Example 对象序列化为字符串, 并通过一个预先定义的 tf.io.TFRecordWriter 写入 TFRecord 文件

TFRecord 文件保存示例

import tensorflow as tf
import os

# root
root_dir = "/Users/zfwang/project/machinelearning/deeplearning"
# project
project_path = os.path.join(root_dir, "deeplearning/src/tensorflow_src")
# model save
models_path = os.path.join(project_path, "save")
# data
cats_and_dogs_dir = os.path.join(root_dir, "datasets/cats_vs_dogs")
data_dir = os.path.join(root_dir, "datasets/cats_vs_dogs/cats_and_dogs_small")
# train data
train_dir = os.path.join(data_dir, "train")
train_cats_dir = os.path.join(train_dir, "cat")
train_dogs_dir = os.path.join(train_dir, "dog")
# tfrecord
tfrecord_file = os.path.join(cats_and_dogs_dir, "train.tfrecord")

# 训练数据
train_cat_filenames = [os.path.join(train_cats_dir, filename) for filename in os.listdir(train_cats_dir)]
train_dog_filenames = [os.path.join(train_dogs_dir, filename) for filename in os.listdir(train_dogs_dir)]
train_filenames = train_cat_filenames + train_dog_filenames
train_labels = [0] * len(train_cat_filenames) + [1] * len(train_dog_filenames)

# 迭代读取每张图片, 建立 tf.train.Feature 字典和 tf.train.Example 对象, 序列化并写入 TFRecord
with tf.io.TFRecordWriter(tfrecord_file) as writer:
    for filename, label in zip(train_filenames, train_labels):
        # 读取数据集图片到内存, image 为一个 Byte 类型的字符串
        image = open(filename, "rb").read()
        # 建立 tf.train.Feature 字典
        feature = {
                # 图片是一个 Byte 对象
                "image": tf.train.Feature(bytes_list = tf.train.BytesList(value = [image])),
                "label": tf.train.Feature(int64_list = tf.train.Int64List(value = [label]))
        }
        # 通过字典建立 Example
        example = tf.train.Example(features = tf.train.Features(feature = feature))
        # 将 Example 序列化并写入 TFRecord 文件
        writer.write(example.SerializeToString())

TFRecord 文件读取

TFRecord 数据文件读取步骤

  1. 通过 tf.data.TFRecordDataset 读入原始的 TFRecord 文件, 获得一个 tf.data.Dataset 数据集对象 此时文件中的 tf.train.Example 对象尚未被反序列化
  2. 通过 tf.data.Dataset.map 方法, 对该数据集对象中的每个序列化的 tf.train.Example 字符串 执行 tf.io.parse_single_example 函数, 从而实现反序列化

TFRecord 数据文件读取示例

import tensorflow as tf
import os
import matplotlib.pyplot as plt

# root
root_dir = "/Users/zfwang/project/machinelearning/deeplearning"
# data
cats_and_dogs_dir = os.path.join(root_dir, "datasets/cats_vs_dogs")
# tfrecord
tfrecord_file = os.path.join(cats_and_dogs_dir, "train.tfrecord")

def _parse_example(example_string):
   """
   将 TFRecord 文件中的每一个序列化的 tf.train.Example 解码
   """
   # 定义 Feature 结构, 告诉解码器每个 Feature 的类型是什么
   feature_description = {
      "image": tf.io.FixedLenFeature([], tf.string),
      "label": tf.io.FixedLenFeature([], tf.int64)
   }
   feature_dict = tf.io.parse_single_example(example_string, feature_description)
   # 解码 JPEG 图片
   feature_dict["image"] = tf.io.decode_jpeg(feature_dict["image"])
   return feature_dict["image"], feature_dict["label"]

# 读取 TFRecord 文件
raw_dataset = tf.data.TFRecordDataset(tfrecord_file)
dataset = raw_dataset.map(_parse_example)

for image, label in dataset:
   plt.title("cat" if label == 0 else "dog")
   plt.imshow(image.numpy())
   plt.show()

tf.io 的其他格式

tf.TensorArray

tf.TensorArray 介绍

在部分网络结构中, 尤其是涉及时间序列的结构中, 可能需要将一系列张量以数组的方式依次存放起来, 以供进一步处理

TensorFlow 提供了 tf.TensorArray (TensorFlow 动态数组) 支持计算图特性的 TensorFlow 动态数组

import tensorflow as tf

@tf.function
def array_write_and_read():
   arr = tf.TensorArray(dtype = tf.float32, size = 3)
   arr = arr.write(0, tf.constant(0.0))
   arr = arr.write(1, tf.constant(1.0))
   arr = arr.write(2, tf.constant(2.0))
   arr_0 = arr.read(0)
   arr_1 = arr.read(1)
   arr_2 = arr.read(2)
   return arr_0, arr_1, arr_2

a, b, c = array_write_and_read()
print(a, b, c)

tf.TensorArray write

由于需要支持计算图, tf.TensorArraywrite() 是不可以忽略左值的, 也就是说, 在图执行模式下, 必须按照以下的形式写入数组, 才可以正常生成一个计算图操作, 并将该操作返回给 arr

arr.write(index, value)
arr.write(index, value)

提升管道性能

训练深度学习模型常常会非常耗时,模型训练的耗时主要来自两个部分

以下是一些构建高效数据管道的建议:

prefetch

interleave

map num_parallel_calls

cache

map batch

特征列

特征列通常用于对结构化数据实施特征工程时使用,图像或者文本数据一般不会用到特征列