我一直在尝试开始使用Google Cloud的AI平台。
我自己的模型是用PyTorch编写的,因此选择开始使用PyTorch。我想我为什么要使用GPU是不言而喻的。
我已经尝试严格按照说明进行操作,并且使用了提供的示例代码。然而,我仍然会遇到错误。我可以创建一个没有问题的作业,但作业最终失败,并出现以下错误:
RuntimeError: CUDA error: no kernel image is available for execution on the device
我对PyTorch比较陌生,对GCP也是完全陌生的,所以我不知道如何解决这个问题,任何帮助都将不胜感激。
完整跟踪:
The replica master 0 exited with a non-zero status of 1.
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 123, in <module>
main()
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 119, in main
experiment.run(args)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 132, in run
train(sequential_model, train_loader, criterion, optimizer, epoch)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 37, in train
for batch_index, data in enumerate(train_loader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 347, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 387, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp>
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: CUDA error: no kernel image is available for execution on the device发布于 2021-02-18 23:20:23
我已经设法用pytorch-gpu.1-6重现了同样的问题。
作为一种解决方法,它可以与pytorch-gpu.1-4一起使用
我想代码中的一些东西是从1.6开始的,但不确定是什么,因为我不熟悉Pytorch。
此外,直到1.7版本,与我们的sample code相比,select the device的代码似乎没有变化。
此外,our GPU basic tier Nvidia Tesla K80似乎是supported by CUDA 9.0, 9.2, 10.0 OR 11
无论如何,我已经在issuetracker上创建了一个public issue,以便AI平台工程团队可以对其进行调查。
https://stackoverflow.com/questions/66104507
复制相似问题