我正在测试Aurélienéron的书“用Scikit- Learning和TensorFlow进行机器学习”第15章中的预培训示例。代码出现在他的github页面上:这里 --参见“无监督预培训”一节中的示例。
使用来自先前训练过的编码器的权重对网络进行预训练应该有助于训练网络。为了检查这一点,我稍微修改了Aurelien的代码,以便它在每次批处理后输出错误,并减少批处理的大小。我这样做是为了在训练开始时看到错误,而训练前的重量的影响应该是最明显的。我预计预培训网络将以较低的错误开始(与不使用预培训的网络相比),因为它是从预先训练的权重开始的。然而,训练前的训练似乎会使训练变慢。
有人知道为什么会这样吗?
前几行输出(使用预培训时)是:
0 Train accuracy after each mini-batch: 0.08
0 Train accuracy after each mini-batch: 0.24
0 Train accuracy after each mini-batch: 0.32
0 Train accuracy after each mini-batch: 0.2
0 Train accuracy after each mini-batch: 0.32
0 Train accuracy after each mini-batch: 0.26
0 Train accuracy after each mini-batch: 0.32
0 Train accuracy after each mini-batch: 0.5
0 Train accuracy after each mini-batch: 0.58
0 Train accuracy after each mini-batch: 0.48
0 Train accuracy after each mini-batch: 0.54
0 Train accuracy after each mini-batch: 0.48
0 Train accuracy after each mini-batch: 0.5
0 Train accuracy after each mini-batch: 0.56
0 Train accuracy after each mini-batch: 0.64
0 Train accuracy after each mini-batch: 0.56
0 Train accuracy after each mini-batch: 0.68
0 Train accuracy after each mini-batch: 0.62
0 Train accuracy after each mini-batch: 0.74
0 Train accuracy after each mini-batch: 0.78正如你所看到的,最初的精确度很低。相反,当使用初始化权值(即不使用预训练)时,初始精度实际上更高:
0 Train accuracy after each mini-batch: 0.62
0 Train accuracy after each mini-batch: 0.5
0 Train accuracy after each mini-batch: 0.52
0 Train accuracy after each mini-batch: 0.38
0 Train accuracy after each mini-batch: 0.56
0 Train accuracy after each mini-batch: 0.56
0 Train accuracy after each mini-batch: 0.6
0 Train accuracy after each mini-batch: 0.7
0 Train accuracy after each mini-batch: 0.72
0 Train accuracy after each mini-batch: 0.86
0 Train accuracy after each mini-batch: 0.86
0 Train accuracy after each mini-batch: 0.8
0 Train accuracy after each mini-batch: 0.82
0 Train accuracy after each mini-batch: 0.84
0 Train accuracy after each mini-batch: 0.88
0 Train accuracy after each mini-batch: 0.9
0 Train accuracy after each mini-batch: 0.82
0 Train accuracy after each mini-batch: 0.9
0 Train accuracy after each mini-batch: 0.84
0 Train accuracy after each mini-batch: 0.98
0 Train accuracy after each mini-batch: 0.96换句话说,训练前似乎放慢了训练的速度,这与它应该做的相反!
我修改的代码是:
import numpy as np
import sys
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
def reset_graph(seed=42):
tf.reset_default_graph()
tf.set_random_seed(seed)
np.random.seed(seed)
def train_stacked_autoencoder():
reset_graph()
# Load the dataset to use
mnist = input_data.read_data_sets("/tmp/data/")
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 150 # codings
n_hidden3 = n_hidden1
n_outputs = n_inputs
learning_rate = 0.01
l2_reg = 0.0001
activation = tf.nn.elu
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
initializer = tf.contrib.layers.variance_scaling_initializer()
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])
weights3_init = initializer([n_hidden2, n_hidden3])
weights4_init = initializer([n_hidden3, n_outputs])
weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
weights3 = tf.Variable(weights3_init, dtype=tf.float32, name="weights3")
weights4 = tf.Variable(weights4_init, dtype=tf.float32, name="weights4")
biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
biases3 = tf.Variable(tf.zeros(n_hidden3), name="biases3")
biases4 = tf.Variable(tf.zeros(n_outputs), name="biases4")
hidden1 = activation(tf.matmul(X, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
hidden3 = activation(tf.matmul(hidden2, weights3) + biases3)
outputs = tf.matmul(hidden3, weights4) + biases4
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X))
optimizer = tf.train.AdamOptimizer(learning_rate)
with tf.name_scope("phase1"):
phase1_outputs = tf.matmul(hidden1, weights4) + biases4 # bypass hidden2 and hidden3
phase1_reconstruction_loss = tf.reduce_mean(tf.square(phase1_outputs - X))
phase1_reg_loss = regularizer(weights1) + regularizer(weights4)
phase1_loss = phase1_reconstruction_loss + phase1_reg_loss
phase1_training_op = optimizer.minimize(phase1_loss)
with tf.name_scope("phase2"):
phase2_reconstruction_loss = tf.reduce_mean(tf.square(hidden3 - hidden1))
phase2_reg_loss = regularizer(weights2) + regularizer(weights3)
phase2_loss = phase2_reconstruction_loss + phase2_reg_loss
train_vars = [weights2, biases2, weights3, biases3]
phase2_training_op = optimizer.minimize(phase2_loss, var_list=train_vars) # freeze hidden1
init = tf.global_variables_initializer()
saver = tf.train.Saver()
training_ops = [phase1_training_op, phase2_training_op]
reconstruction_losses = [phase1_reconstruction_loss, phase2_reconstruction_loss]
n_epochs = [4, 4]
batch_sizes = [150, 150]
use_cached_results = True
# Train both phases
if not use_cached_results:
with tf.Session() as sess:
init.run()
for phase in range(2):
print("Training phase #{}".format(phase + 1))
for epoch in range(n_epochs[phase]):
n_batches = mnist.train.num_examples // batch_sizes[phase]
for iteration in range(n_batches):
print("\r{}%".format(100 * iteration // n_batches), end="")
sys.stdout.flush()
X_batch, y_batch = mnist.train.next_batch(batch_sizes[phase])
sess.run(training_ops[phase], feed_dict={X: X_batch})
loss_train = reconstruction_losses[phase].eval(feed_dict={X: X_batch})
print("\r{}".format(epoch), "Train MSE:", loss_train)
saver.save(sess, "./my_model_one_at_a_time.ckpt")
loss_test = reconstruction_loss.eval(feed_dict={X: mnist.test.images})
print("Test MSE (uncached method):", loss_test)
# Train both phases, but in this case we cache the frozen layer outputs
if use_cached_results:
with tf.Session() as sess:
init.run()
for phase in range(2):
print("Training phase #{}".format(phase + 1))
if phase == 1:
hidden1_cache = hidden1.eval(feed_dict={X: mnist.train.images})
for epoch in range(n_epochs[phase]):
n_batches = mnist.train.num_examples // batch_sizes[phase]
for iteration in range(n_batches):
print("\r{}%".format(100 * iteration // n_batches), end="")
sys.stdout.flush()
if phase == 1:
# Phase 2 - use the cached output from hidden layer 1
indices = np.random.permutation(mnist.train.num_examples)
hidden1_batch = hidden1_cache[indices[:batch_sizes[phase]]]
feed_dict = {hidden1: hidden1_batch}
sess.run(training_ops[phase], feed_dict=feed_dict)
else:
# Phase 1
X_batch, y_batch = mnist.train.next_batch(batch_sizes[phase])
feed_dict = {X: X_batch}
sess.run(training_ops[phase], feed_dict=feed_dict)
loss_train = reconstruction_losses[phase].eval(feed_dict=feed_dict)
print("\r{}".format(epoch), "Train MSE:", loss_train)
saver.save(sess, "./my_model_cache_frozen.ckpt")
loss_test = reconstruction_loss.eval(feed_dict={X: mnist.test.images})
print("Test MSE (cached method):", loss_test)
def unsupervised_pretraining():
reset_graph()
# Load the dataset to use
mnist = input_data.read_data_sets("/tmp/data/")
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 150
n_outputs = 10
learning_rate = 0.01
l2_reg = 0.0005
activation = tf.nn.elu
regularizer = tf.contrib.layers.l2_regularizer(l2_reg)
initializer = tf.contrib.layers.variance_scaling_initializer()
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
y = tf.placeholder(tf.int32, shape=[None])
weights1_init = initializer([n_inputs, n_hidden1])
weights2_init = initializer([n_hidden1, n_hidden2])
weights3_init = initializer([n_hidden2, n_outputs])
weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1")
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2")
weights3 = tf.Variable(weights3_init, dtype=tf.float32, name="weights3")
biases1 = tf.Variable(tf.zeros(n_hidden1), name="biases1")
biases2 = tf.Variable(tf.zeros(n_hidden2), name="biases2")
biases3 = tf.Variable(tf.zeros(n_outputs), name="biases3")
hidden1 = activation(tf.matmul(X, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
logits = tf.matmul(hidden2, weights3) + biases3
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
reg_loss = regularizer(weights1) + regularizer(weights2) + regularizer(weights3)
loss = cross_entropy + reg_loss
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
init = tf.global_variables_initializer()
pretrain_saver = tf.train.Saver([weights1, weights2, biases1, biases2])
saver = tf.train.Saver()
n_epochs = 4
batch_size = 50
n_labeled_instances = 2000
pretraining = True
# Regular training (without pretraining):
if not pretraining:
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
n_batches = n_labeled_instances // batch_size
for iteration in range(n_batches):
#print("\r{}%".format(100 * iteration // n_batches), end="")
#sys.stdout.flush()
indices = np.random.permutation(n_labeled_instances)[:batch_size]
X_batch, y_batch = mnist.train.images[indices], mnist.train.labels[indices]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
print("\r{}".format(epoch), "Train accuracy after each mini-batch:", accuracy_val)
sys.stdout.flush()
accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
print("\r{}".format(epoch), "Train accuracy after all batched:", accuracy_val, end=" ")
saver.save(sess, "./my_model_supervised.ckpt")
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
print("Test accuracy (without pretraining):", accuracy_val)
# Now reuse the first two layers of the autoencoder we pretrained:
if pretraining:
training_op = optimizer.minimize(loss, var_list=[weights3, biases3]) # Freeze layers 1 and 2 (optional)
with tf.Session() as sess:
init.run()
pretrain_saver.restore(sess, "./my_model_cache_frozen.ckpt")
for epoch in range(n_epochs):
n_batches = n_labeled_instances // batch_size
for iteration in range(n_batches):
#print("\r{}%".format(100 * iteration // n_batches), end="")
#sys.stdout.flush()
indices = np.random.permutation(n_labeled_instances)[:batch_size]
X_batch, y_batch = mnist.train.images[indices], mnist.train.labels[indices]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
print("\r{}".format(epoch), "Train accuracy after each mini-batch:", accuracy_val)
sys.stdout.flush()
accuracy_val = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
print("\r{}".format(epoch), "Train accuracy after all batched:", accuracy_val, end=" ")
saver.save(sess, "./my_model_supervised_pretrained.ckpt")
accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
print("Test accuracy (with pretraining):", accuracy_val)
if __name__ == "__main__":
# Seed the random number generator
np.random.seed(42)
tf.set_random_seed(42)
# Fit a multi-layer autoencoder and save the weights
# - this part is from Aurelien Geron's Ch 15, "Training one Autoencoder at a time in a single graph" example
train_stacked_autoencoder()
# Fit a network, using the weights previously saved for pretraining
# - this part is from Aurelien Geron's Ch 15, "Unsupervised pretraining" example
unsupervised_pretraining()发布于 2018-05-02 09:08:34
注:我没有读过奥里连·杰伦的教程,但我读过这本书.
从直觉的角度来看,我可以说服自己,对于一个预先训练过的模型来说,训练实际上会慢一些。换句话说,错误减少(或提高准确性)的速度可能较低,这是有道理的。事实上,训练的准确性较低(至少对我来说)是有点复杂,也许,具体情况。
然而,训练前的训练似乎会使训练变慢。
使用预先训练的模型,我们基本上采取了一组权重,这些权重已经(至少部分)优化了一个问题。他们的目标是根据他们收到的数据来解决这个问题,这意味着他们期望输入对应于一个特定的分布。您已经用以下行冻结了前两层:
if pretraining: training_op = optimizer.minimize(loss, var_list=[weights3, biases3])
冻结两层(在你的情况下,三层),直观地限制了模型。
这里有一个我可以用来向我自己解释这种情况的有点人为的类比。想象一下,我们有一个小丑,他可以玩三个球,但现在我们想让他们学会使用第四个球。同时,我们要求一个业余爱好者学习如何玩杂耍,也是用四个球。在测量他们的学习速度之前,我们决定把小丑的一只手绑在背后。所以小丑已经知道了一些技巧,但在学习过程中也受到了某种程度的限制。在我看来,业余爱好者很可能学得更快(相对地),因为有更多的东西要学,但也因为他们有更多的自由去探索参数空间,也就是他们可以更自由地使用两只手臂。
在优化设置中,人们可能会想象,预先训练过的模型在损失曲线上的位置已经在一个特定维度的梯度非常小的地方(别忘了,我们有一个高维搜索空间)。这最终意味着它不能在反向传播错误的同时对权重的输出进行快速的更改,因为权重更新是这些潜在的优化权重的倍数。
...Ok --听起来可能有些道理,但这只解决了学习速度慢的问题--实际的训练精度比随机初始化的模型要低呢?
我预期预先训练的网络会以较低的错误开始(与不使用预训练的网络相比).
在这里,我倾向于同意你的观点。在最优的情况下,我们可以采取一个预先训练的模型,使用初始层的现状,只是微调最后的层。然而,在某些情况下,这种做法可能行不通。
查阅相关文献,论文摘要中有一个可能的解释:深层神经网络中的特征是如何传递的?(约辛斯基等人):
迁移性受到两个不同问题的负面影响:(1)高层神经元对原任务的专门化,而牺牲了预期的目标任务的性能;(2)与共适应神经元之间的分裂网络有关的优化困难,这是意料之外的。
我发现第二个原因是特别有趣和相关的设置。这是因为你实际上只有三层。因此,您不允许必须自由微调,最后一层很可能是非常依赖于它与前一层的关系。
通过使用经过预先训练的模型,您可能会看到的是,最终的模型显示出了更好的通用性。这可能是以降低您培训的特定数据集的测试精度为代价的。
以下是另一种想法,由令人惊异(且免费)的斯坦福大学CS231n课程总结:
学习率。与计算新数据集类分数的新线性分类器的(随机初始化的)权重相比,对正在微调的ConvNet权重使用较小的学习速率是很常见的。
在您的代码中,在0.01的所有学习阶段,学习速度似乎都是固定的。这是你可以尝试的东西,使它更小,为预先训练的层次,或刚刚开始一个较低的全球学习率。
下面是一个中转学习概论,它可能会给您提供更多关于为什么/在哪里可能做出不同的建模决策的想法。
https://datascience.stackexchange.com/questions/31100
复制相似问题