使用AzureML服务,我如何使用Horovod在多个节点上为keras深度学习的不同时期转储正确的损失曲线或精度曲线?
使用Horovod和AzureML的Keras深度学习的Loss vs epochs plt似乎有问题。
使用Keras/Horovod (2个GPU)和AMLS SDK训练CNN会生成奇怪的图形


发布于 2019-08-11 14:11:03
看起来你可能正在训练2个模型,并且不同节点的平均梯度是值得注意的。你能分享更多你的训练脚本吗--你是否像这样将你的优化器包装在DistributedOptimizer中:
# Horovod: adjust learning rate based on number of GPUs.
opt = keras.optimizers.Adadelta(1.0 * hvd.size())
# Horovod: add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)此外,您实际上只需要一台机器进行日志记录,因此通常只附加一个0级的AzureML记录器,如下所示:
class LogToAzureMLCallback(tf.keras.callbacks.Callback):
def on_batch_end(self, batch, logs=None):
Run.get_context().log('acc',logs['acc'])
def on_epoch_end(self, epoch, logs=None):
Run.get_context().log('epoch_acc',logs['acc'])
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0)
]
# Horovod: save checkpoints only on worker 0 and only log to AzureML from worker 0.
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
callbacks.append(LogToAzureMLCallback())
model.fit(x_train, y_train,
batch_size=batch_size,
callbacks=callbacks,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))发布于 2019-08-10 01:58:43
您是如何记录这些指标的?从图中看,似乎有两组交错的数据点。
https://stackoverflow.com/questions/57432959
复制相似问题