我正在尝试调试我的tensorflow代码,它在大约30个时期后突然产生NaN丢失。你可能会在这个SO question中找到我的具体问题和我尝试过的东西。
我在训练期间监控了每个小批量的所有层的权重,发现权重突然跳到NaN,尽管在前一次迭代中所有的权重值都小于1(我已经将kernel_constraint max_norm设置为1)。这使得很难找出哪个操作是罪魁祸首。
Pytorch有一个很酷的调试方法torch.autograd.detect_anomaly,它会在任何产生Pytorch值并显示回溯的反向计算中产生错误。这使得调试代码变得很容易。
在TensorFlow中有类似的东西吗?如果没有,你能推荐一种方法来调试它吗?
发布于 2021-10-13 10:25:19
tensorflow中确实有类似的调试工具。参见tf.debugging.check_numerics。
这可用于跟踪在训练期间产生inf或nan值的张量。一旦找到这样的值,tensorflow就会生成一个InvalidArgumentError。
tf.debugging.check_numerics(LayerN, "LayerN is producing nans!")如果张量LayerN有nans,你会得到一个类似这样的错误:
Traceback (most recent call last):
File "trainer.py", line 506, in <module>
worker.train_model()
File "trainer.py", line 211, in train_model
l, tmae = train_step(*batch)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 855, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: LayerN is producing nans! : Tensor had NaN valueshttps://stackoverflow.com/questions/69517347
复制相似问题