文章/答案/技术大牛

发布

社区首页 >问答首页 >基于GPU集群的Horovod深度学习模型分布式训练python程序

问基于GPU集群的Horovod深度学习模型分布式训练python程序
EN

Stack Overflow用户

提问于 2020-07-11 21:15:16

回答 1查看 171关注 0票数 2

我试图在databricks GPU集群上运行一些示例python3代码https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html (有一个驱动程序和两个工作人员)。

数据库环境：

 ML 6.6, scala 2.11, Spark 2.4.5, GPU

它适用于分布式深度学习模式的训练。

我刚试了一个很简单的例子：

 from sparkdl import HorovodRunner
 hr = HorovodRunner(np=2)

 def train():
   print('in train')
   import tensorflow as tf
   print('after import tf')
   hvd.init()
   print('done')

 hr.run(train)

但是，该命令一直在运行，没有任何进展。

HorovodRunner will stream all training logs to notebook cell output. If there are too many 
logs, you
can adjust the log level in your train method. Or you can set driver_log_verbosity to
'log_callback_only' and use a HorovodRunner log  callback on the first worker to get concise
progress updates.
The global names read or written to by the pickled function are {'print', 'hvd'}.
The pickled object size is 1444 bytes.

### How to enable Horovod Timeline? ###
HorovodRunner has the ability to record the timeline of its activity with Horovod  Timeline. 
To
record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable  to the location 
of the
timeline file to be created. You can then open the timeline file  using the chrome://tracing
facility of the Chrome browser.

我是错过了什么，还是需要设置一些东西来使它工作？

谢谢

deep-learning

gpu

databricks

horovod

distributed-training

回答 1

Stack Overflow用户

发布于 2022-04-21 13:20:54

你的代码里面没有实际的训练。在编写更好的示例代码时，您可能会运气更好。

https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/mnist-pytorch.html

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62854672

复制

相似问题

问基于GPU集群的Horovod深度学习模型分布式训练python程序
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于GPU集群的Horovod深度学习模型分布式训练python程序EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于GPU集群的Horovod深度学习模型分布式训练python程序
EN