logo

TensorFlow 模型编译和训练

王哲峰 / 2022-07-15


目录

回调函数

模型训练

TensorFlow 训练模型通常有 3 种方法:

使用 GPU 训练模型

使用单 GPU 训练模型

使用多 GPU 训练模型

使用 TPU 训练模型

损失函数

一般来说,监督学习的目标函数由损失函数和正则化项组成,

$$Objective = Loss + Regularization$$

对于 Keras 模型:

如果有需要,也可以自定义损失函数,自定义损失函数需要接收两个张量 y_truey_pred 作为输入参数, 并输出一个标量作为损失函数值

内置损失函数

内置损失函数的两种形式

内置的损失函数一般有类的实现和函数的实现两种形式,如

tf.keras.losses.Class(
    from_loits = False,
    label_smoothing = 0,
    reduction = "auto",
    name = ""
)
# data
y_ture = [[0.,1.],[0.,0.]]
y_pred = [[0.6,0.4],[0.4,0.6]]

# reduction="auto" or "sum_over_batch_size"
bce = tf.keras.losses.BinaryCrossentropy()
bce(y_true,y_pred).numpy()

# reduction=sample_weight
bce = tf.keras.losses.BinaryCrossentropy()
bce(y_true,y_pred,sample_weight = [1,0]).numpy()

# reduction=sum
bce = tf.keras.losses.BinaryCrossentropy(
    reduction = tf.keras.losses.Reduction.SUM
)
bce(y_true,y_pred).numpy()

# reduction=none
bce = tf.keras.losses.BinaryCrossentropy(
    reduction = tf.keras.losses.Reduction.NONE
)
bce(y_true,y_pred).numpy()

回归损失

二分类损失

多分类损失

概率损失

创建自定义损失函数

如果有需要,也可以自定义损失函数,自定义损失函数需要接收两个张量 y_truey_pred 作为输入参数, 并输出一个标量作为损失函数值

类形式的损失函数

自定义损失函数需要继承 tf.keras.losses.Loss 类,重写 call 方法即可, 输入真实值 y_true 和模型预测值 y_pred,输出模型预测值和真实值之间通 过自定义的损失函数计算出的损失值

Focal Loss 是一种对 binary_crossentropy 的改进损失函数形式。 它在样本不均衡和存在较多易分类的样本时相比 binary_crossentropy 具有明显的优势。 它有两个可调参数,alpha 参数和 gamma 参数。其中 alpha 参数主要用于衰减负样本的权重, gamma 参数主要用于衰减容易训练样本的权重。从而让模型更加聚焦在正样本和困难样本上。 这就是为什么这个损失函数叫做 Focal Loss,其数学表达式如下:

$$focal\_loss(y,p) = \begin{cases} -\alpha (1-p)^{\gamma}\log(p) & \text{if y = 1}\\ -(1-\alpha) p^{\gamma}\log(1-p) & \text{if y = 0} \end{cases} $$

import tensorflow as tf
from tensorflow.keras import losses

class FocalLoss(losses.Loss):

    def __init__(self,
                 gamma = 2.0,
                 alpha = 0.75,
                 name = "focal_loss"):
        self.gamma = gamma
        self.alpha = alpha
    
    def call(self,y_true,y_pred):
        bce = tf.losses.binary_crossentropy(y_true,y_pred)
        p_t = (y_true * y_pred) + ((1 - y_true) * (1 - y_pred))
        alpha_factor = y_true * self.alpha + (1 - y_true) * (1 - self.alpha)
        modulating_factor = tf.pow(1.0 - p_t,self.gamma)
        loss = tf.reduce_sum(
            alpha_factor * modulating_factor * bce,
            axis = -1
        )
        return loss

函数形式的损失函数

import tensorflow as tf
from tensorflow.keras import losses

def focal_loss(gamma = 2.0,alpha = 0.75):
    def focal_loss_fixed(y_true,y_pred):
        bce = tf.losses.binary_crossentropy(y_true,y_pred)
        p_t = (y_true * y_pred) + ((1- y_true) * (1 - y_pred))
        alpha_factor = y_true * alpha + (1 - y_true) * (1 - alpha)
        modulating_factor = tf.pow(1.0 - p_t,gamma)
        loss = tf.reduce_sum(
            alpha_factor * modulating_factor * bce,
            axis = -1
        )
        
    return focal_loss_fixed

损失函数的使用—compile() & fit()

通过实例化一个损失类创建损失函数

from tensorflow import keras
from tensorflow.keras import layers,losses

# 模型构建
model = keras.Sequential()
model.add(
    layers.Dense(
        64,
        kernel_initializer = "uniform",
        input_shape = (10,)
    )
)
model.add(layers.Activation("softmax"))

# 模型编译
model.compile(
    loss = losses.SparseCategoricalCrossentropy(from_logits = True),
    optimizer = "adam",
    metrics = ["acc"]
)

直接使用损失函数

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.losses import sparse_categorical_crossentropy

# 模型构建
model = keras.Sequential()
model.add(
    layers.Dense(
        64,
        kernel_initializer = "uniform",
        input_shape = (10,)
    )
)
model.add(layers.Activation("softmax"))

# 模型编译
model.compile(
    loss = "sparse_categorical_crossentropy",
    optimizer = "adam",
    metrics = ["acc"]
)

损失函数的使用—单独使用

tf.keras.losses.mean_squared_error(tf.ones((2,2)),tf.zeros((2,2)))
loss_fn = tf.keras.losses.MeanSquaredError(resuction = "sum_over_batch_size")
loss_fn(tf.ones((2,2)),tf.zeros((2,2)))

loss_fn = tf.keras.losses.MeanSquaredError(reduction = "sum")
loss_fn(tf.ones((2,2)),tf.zeros((2,2)))

loss_fn = tf.keras.losses.MeanSquaredError(reduction = "none")
loss_fn(tf.ones((2,2)),tf.zeros((2,2)))

loss_fn = tf.keras.losses.mean_squared_error
loss_fn(tf.ones((2,2,)),tf.zeros((2,2)))

loss_fn = tf.keras.losses.MeanSquaredError()
loss_fn(tf.ones((2,2)),tf.zeros((2,2)))

TensorFlow 优化器

模型优化算法的选择直接关系到最终模型的性能,有时候效果不好, 未必是特征的问题或者模型设计的问题,很可能就是优化算法的问题

深度学习优化算法大概经历了以下的发展历程:

SGD -> SGDM -> NAG -> Adagrad -> Adadelta(RMSprop) -> Adam -> Nadam

对于一般的新手,优化器直接使用 Adam,并使用默认参数就可以了。 如果要追求评价指标效果,可能会偏爱前期使用 Adam 优化器快速下降, 后期使用 SGD 并精调优化器参数得到更好的结果

此外,目前也有一些前沿的优化算法,据称效果比 Adam 更好, 例如: LazyAdam、Look-ahead、RAdam、Ranger 等

TensorFlow 内置优化器

tf.keras.optimizers 子模块中,深度学习优化算法基本都有对应的类实现

from tensorflow.keras import optimizers

SGD

SGD,默认参数为纯 SGD

在计算梯度时计算的是向前走一步所在位置的梯度

sgd = optimizers.SGD(lr = 0.01)
model.compile(loss, optimizer = sgd)
# or
model.compile(loss, optimizer = "sgd")

SGDM(TODO)

sgdm = optimizers.SGD(lr = 0.01, momentum = 0.1)
model.compile(loss, optimizer = sgdm)
# or
model.compile(loss, optimizer = "sgdm")

NAG(TOOD)

nag = optimizers.SGD(lr = 0.01, nesterov = True)
model.compile(loss, optimizer = nag)
# or
model.compile(loss, optimizer = "nag")

RMSprop

rmsprop = optimizers.RMSprop(lr = 0.001)
model.compile(loss, optimizer = rmsprop)
# or
model.compile(loss, optimizer = "rmsprop")

NAG

Adagrad

adagrad = optimizers.Adagrad(lr = 0.01)
model.compile(loss, optimizer = adagrad)
# or
model.compile(loss, optimizer = "adagrad")

Adadelta

adadelta = optimizers.Adadelta(lr = 1.0)
model.compile(loss, optimizer = adadelta)
# or
model.compile(loss, optimizer = "adadelta")

Adam

adam = optimizers.Adam(lr = 0.001)
model.compile(loss, optimizer = adam)
# or
model.compile(loss, optimizer = "adam")

Adamax(TODO)

adamax = optimizers.Adamax(lr = 0.02)
model.compile(loss, optimizer = adamax)
# or
model.compile(loss, optimizer = "adamax")

Nadam

nadam = optimizers.Nadam(lr = 0.002)
model.compile(loss, optimizer = nadam)
# or
model.compile(loss, optimizer = "nadam")

TensorFlow 优化器使用方法

初始化优化器时会创建一个变量 optimizer.iterations 用于记录迭代的次数。 因此优化器和 tf.Variable 一样,一般需要在 @tf.function 外创建

模型编译和拟合

from tensorflow import keras
from tensorflow.keras import layers

# model
model = keras.Sequential()
model.add(layers.Dense(64, kernel_initializer = "uniform", input_shape = (10,)))
model.add(layers.Activate("softmax"))

# model compile
opt = keras.optimizers.Adam(learning_rate = 0.01)
model.compile(loss = "categorical_crossentropy", optimizer = opt)
# or
# model.compile(loss = "categorical_crossentropy", optimizer = "adam")

自定义迭代训练

# Instantiate an optimizer
optimizer = tf.keras.optimizer.Adam()

# Iterate over the batches of a dataset
for x, y in dataset:
    # open a GradientTape
    with tf.GradientTape() as tape:
        # Forward pass
        logits = model(x)
        
        # Loss value for this batch
        loss_value = loss_fn(y, logits)

    # Get gradients of loss wrt the weights
    gradients = tape.gradient(loss_value, model.trainable_weights)

    # Update the weights of the model
    optimizer.apply_gradients(zip(gradients, model.trainable_weights))

学习率衰减和调度

lr_schedule = keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate = 1e-2,
    decay_steps = 10000,
    decay_rate = 0.9
)
optimizer = keras.optimizers.SGD(learning_rate = lr_schedule)

TensorFlow 优化器使用示例

optimizer.apply_gradients

$f(x) = a \times x^{2} + b \times x + c$ 的最小值

import tensorflow as tf
from tensorflow.keras import optimizer

# 自变量
x = tf.Variable(0.0, name = "x", dtype = tf.float32)

# 优化器
optimizer = optimizer.SGD(learning_rate = 0.01)

@tf.function
def minimizef():
    a = tf.constant(1.0)
    b = tf.constant(-2.0)
    c = tf.constant(1.0)
    while tf.constant(True):
        with tf.GradientTape() as tape:
            y = a * tf.pow(x, 2) + b * x + c
        dy_dx = tape.gradient(y, x)
        optimizer.apply_gradient(grads_and_vars = [(dy_dx, x)])

        # 迭代终止条件
        if tf.abs(dy_dx) < tf.constant(0.00001):
            break
        
        # 每 100 次迭代打印日志
        if tf.math.mod(optimizer.iterations, 100) == 0:
            printbar()
            tf.print(f"step = {optimizer.iterations}")
            tf.print(f"x = {x}")
            tf.print("")
    y = a * tf.pow(x, 2) + b * x + c

    return y

tf.print(f"y = {minimizef()}")
tf.print(f"x = {x}")

optimizer.minimize

$f(x) = a \times x^{2} + b \times x + c$ 的最小值

import tensorflow as tf
from tensorflow.keras import optimizer

# 自变量
x = tf.Variable(0.0, name = "x", dtype = tf.float32)

# 优化器
optimizer = optimizer.SGD(learning_rate = 0.01)

def f():
    a = tf.constant(1.0)
    b = tf.constant(-2.0)
    c = tf.constant(1.0)
    y = a * tf.pow(x, 2) + b * x + c

@tf.function
def train(epoch = 1000):
    for _ in tf.range(epoch):
        optimizer.minimize(f, [x])
    tf.print(f"epoch = {optimizer.iterations}")
    return f()

train(1000)
tf.print(f"y = {f()}")
tf.print(f"x = {x}")

model.compile 和 model.fit

$f(x) = a \times x^{2} + b \times x + c$ 的最小值

import tensorflow as tf
from tensorflow.keras import optimizers

tf.keras.backend.clear_session()


# 模型构建
class FakeModel(tf.keras.models.Model):

    def __init__(self, a, b, c):
        super(FakeModel, self).__init__()
        self.a = a
        self.b = b
        self.c = c
    
    def build(self):
        self.x = tf.Variable(0.0, name = "x")
        self.built = True
    
    def call(self, features):
        loss = self.a * (self.x) ** 2 + self.b * (self.x) + self.c
        return (tf.ones_like(features) * loss)

model = FakeMOdel(
    tf.constant(1.0),
    tf.constant(-2.0),
    tf.constant(1.0)
)
model.build()
model.summary()

# 损失函数
def myloss(y_true, y_pred):
    return tf.reduce_mean(y_pred)

# 优化器(SGDM)
sgd = optimizers.SGD(
    lr = 0.01, 
    decay = 1e-6, 
    momentum = 0.9, 
    nesterov = True,
)

# 编译模型
model.compile(
    loss = myloss, 
    optimizer = sgd,
)

# 模型训练
history = model.fit(
    tf.zeros((100, 2)),
    tf.ones(100),
    batch_size = 1, 
    epochs = 1000,
)

# 模型结果
tf.print(f"x = {model.x}")
tf.print(f"loss = {model(tf.constant(0.0))}")

TensorFlow 优化器核心 API

apply_gradients

Optimizer.apply_gradients(
    grads_and_vars, name=None, experimental_aggregate_gradients=True
)
grads = tape.gradient(loss, vars)
grads = tf.distribute.get_replica_context().all_reduce("sum", grads)

# Processing aggregated gradients.
optimizer.apply_gradients(zip(grad, vars), experimental_aggregate_gradients = False)

weights_property

import tensorflow as tf

tf.keras.optimizers.Optimizer.weights

get_weights

Optimizer.get_weights()
# 模型优化器
opt = tf.keras.optimizers.RMSprop()

# 模型构建、编译
m = tf.keras.models.Sequential()
m.add(tf.keras.layers.Dense(10))
m.compile(opt, loss = "mse")

# 数据
data = np.arange(100).reshape(5, 20)
labels = np.zeros(5)

# 模型训练
print("Training")
results = m.fit(data, labels)
print(opt.get_weights)

set_weights

Optimizer.set_weights(weights)
# 模型优化器
opt = tf.keras.optimizers.RMSprop()

# 模型构建、编译
m = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
m.compile(opt, loss = "mse")

# 数据        
data = np.arange(100).reshape(5, 20)
labels = np.zeros(5)

# 模型训练
print("Training")
results = m.fit(data, labels)

# 优化器新权重
new_weights = [
    np.array(10),       # 优化器的迭代次数
    np.ones([20, 10]),  # 优化器的状态变量
    np.zeros([10])      # 优化器的状态变量
]
opt.set_weights(new_weights)
opt.iteration

TensorFlow 优化器共有参数

control gradient clipping

from tensorflow.keras import optimizers

# All parameter gradients will be clipped to
# a maximum norm of 1.
sgd = optimizers.SGD(lr = 0.01, clipnorm = 1)

# All parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
sgd = optimizers.SGD(lr = 0.01, clipvalue = 0.5)

参考资料