文章/答案/技术大牛

发布

问TPU突然停止训练
EN

Stack Overflow用户

提问于 2019-11-18 15:05:19

回答 2查看 743关注 0票数 0

通过遵循官方教程中的说明，我尝试使用Google中的TPU来训练变压器模型。加载数据运行良好，并且在运行之后

t2t-trainer \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --problem=translate_ende_wmt32k_packed \
  --train_steps=500000 \
  --eval_steps=3000 \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --use_tpu=True \
  --cloud_tpu_name=$TPU_NAME

培训确实按预期开始，输出可能看起来有点像这样：

I1118 14:48:18.978163 140580835792320 tpu_estimator.py:2307] global_step/sec: 15.2942                                                                                                                                                   [114/1944]
INFO:tensorflow:examples/sec: 978.827                                                                                             
I1118 14:48:18.978595 140580835792320 tpu_estimator.py:2308] examples/sec: 978.827                                                
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.                                               
I1118 14:48:18.979720 140580835792320 tpu_estimator.py:600] Enqueue next (100) batch(es) of data to infeed.                       
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.                                                                
I1118 14:48:18.979935 140580835792320 tpu_estimator.py:604] Dequeue next (100) batch(es) of data from outfeed.
I1118 14:48:24.292932 140577566803712 transport.py:157] Attempting refresh to obtain initial access_token                         
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-8 in state READY, and health HEALTHY.                                         
W1118 14:48:24.353135 140577566803712 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-8 in state READY, and health HEALTHY.
INFO:tensorflow:loss = 1.8486812, step = 113800 (6.536 sec)                                                                       
I1118 14:48:25.512768 140580835792320 basic_session_run_hooks.py:260] loss = 1.8486812, step = 113800 (6.536 sec)                 
INFO:tensorflow:global_step/sec: 15.2986                                                                 
I1118 14:48:25.514695 140580835792320 tpu_estimator.py:2307] global_step/sec: 15.2986                                             
INFO:tensorflow:examples/sec: 979.11                                                                                              
I1118 14:48:25.515115 140580835792320 tpu_estimator.py:2308] examples/sec: 979.11                                
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.                                                                   
I1118 14:48:25.516618 140580835792320 tpu_estimator.py:600] Enqueue next (100) batch(es) of data to infeed.                       
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.                                       
I1118 14:48:25.516829 140580835792320 tpu_estimator.py:604] Dequeue next (100) batch(es) of data from outfeed.                    
INFO:tensorflow:Outfeed finished for iteration (388, 47)                                                                          
I1118 14:48:28.761935 140577575196416 tpu_estimator.py:279] Outfeed finished for iteration (388, 47)       
INFO:tensorflow:loss = 1.5237397, step = 113900 (6.573 sec)                                                                       
I1118 14:48:32.086134 140580835792320 basic_session_run_hooks.py:260] loss = 1.5237397, step = 113900 (6.573 sec)

然而，有时，经过一次不确定的迭代次数(有时小于25k，有时超过400 k，有时从未)，训练突然停止。没有错误消息，但没有取得更多进展。在本例中，我得到以下输出：

I1120 13:40:33.828651 140684764419520 tpu_estimator.py:2307] global_step/sec: 16.3988
INFO:tensorflow:examples/sec: 1049.52
I1120 13:40:33.829339 140684764419520 tpu_estimator.py:2308] examples/sec: 1049.52
INFO:tensorflow:Enqueue next (1000) batch(es) of data to infeed.
I1120 13:40:33.830607 140684764419520 tpu_estimator.py:600] Enqueue next (1000) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1000) batch(es) of data from outfeed.
I1120 13:40:33.830862 140684764419520 tpu_estimator.py:604] Dequeue next (1000) batch(es) of data from outfeed.
INFO:tensorflow:Outfeed finished for iteration (7, 0)
I1120 13:40:34.267921 140681504278272 tpu_estimator.py:279] Outfeed finished for iteration (7, 0)
I1120 13:40:39.989195 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:40:40.056418 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:41:10.124164 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:41:10.177670 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:41:40.259634 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:41:40.309398 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:42:10.377460 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health UNKNOWN.
W1120 13:42:10.431982 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health UNKNOWN.
I1120 13:42:40.508342 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:42:40.567739 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:43:10.638391 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:43:10.694900 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:43:40.763782 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:43:40.810777 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:44:10.889873 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:44:10.942733 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:44:41.011034 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:44:41.066553 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.

请注意，报告的健康状况曾经是UNKNOWN，这可能与此问题有关，也可能与此无关。

要恢复训练，我必须停止训练过程并再次运行训练命令。然后，它将加载最新的检查点，并继续训练，直到它最终停止。

我在tmux会话中运行训练命令，所以这不应该是由我和Google之间的连接问题引起的。事实上，我可以完全关闭所有窗口，并从另一台PC连接到运行培训课程。

我已经看到了TPU培训在培训过程中冻结这个问题，但是我使用的是一个预定义的模型，我的桶是在同一个区域中定义的( us-central1-a中的TPU，us-central1中的存储桶)。

编辑:如果这是相关的，我目前正在一个月的免费试用，我得到了申请的TensorFlow研究云项目。也许那些集群节点比付费节点更不稳定？

Edit2:也许这与GitHub related TPU在3小时后死亡(例如没有“健康”状态) (和后续行动)有关？请注意，问题已经结束，但给出的答案似乎与问题无关。此外，我在云VM中检查了文件/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/tpu/preempted_hook.py，并且已经合并了两个链接的更改。

google-compute-engine

tpu

google-cloud-tpu

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-12-12 14:12:24

据报道，这是GitHub (#1，#2)上的一个bug，并随后进行了修复。如果仍然发生错误，则应回答第二个GitHub问题。请注意，您可能需要重新创建TPU，重新启动它可能还不够。

票数 0

Stack Overflow用户

发布于 2019-11-20 04:30:43

我在TFRC的TPU训练时也有过同样的问题。正如警告说的那样，TPU和Google之间的连接似乎存在问题，甚至我们也遵循了指示。

我尝试了几种解决方案：

删除gcloud配置文件夹 rm -rf ~/..config/gcloud
更新gcloud sdk： gcloud组件更新
让TPU通过IAM 链接访问云桶！

TPU挂起的错误仍然发生，但频率较低。希望它能对你的情况有所帮助，或者你可以找出通用的解决方案。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58917482

复制

相似问题

问TPU突然停止训练
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TPU突然停止训练EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TPU突然停止训练
EN