我正在使用谷歌协作来训练一个图像识别算法,使用TensorFlow 1.15。我已经上传了所有需要的文件到Google Drive,并且已经让代码运行,直到shuffle buffer完成运行。但是,我在对话框中得到一个"^C“,并且不知道发生了什么。
注意:我之前曾尝试在我的PC上训练算法,并且没有删除从上一次训练会话中生成的检查点文件。这可能就是问题所在吗?
代码:
!pip install --upgrade pip
!pip install --upgrade protobuf
!pip install tensorflow-gpu==1.15
import tensorflow as tf
print(tf.__version__)
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at {}'.format(device_name))
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
gpu = GPUs[0]
def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()
from google.colab import drive
#Mount the drive
drive.mount('/content/gdrive')
#Change to working tensorflow directory on the drive
%cd '/content/gdrive/My Drive/weeds/tensorflow_models/models/research/object_detection/'
!apt-get install protobuf-compiler python-pil python-lxml python-tk
!pip install Cython
%cd /content/gdrive/My Drive/weeds/tensorflow_models/models/research/
!protoc object_detection/protos/*.proto --python_out=.
import os
os.environ['PYTHONPATH'] += ':/content/gdrive/My Drive/weeds/tensorflow_models/models/research/:/content/gdrive/My Drive/weeds/tensorflow_models/models/research/slim'
!python setup.py build
!python setup.py install
import time, psutil
Start = time.time() - psutil.boot_time()
Left = 12*3600 - Start
print('Time remaining for this session is: ', Left/3600)
!pip install tf_slim
%cd /content/gdrive/My Drive/weeds/tensorflow_models/models/research/object_detection/
os.environ['PYTHONPATH'] += ':/content/gdrive/My Drive/weeds/tensorflow_models/models/research/:/content/gdrive/My Drive/weeds/tensorflow_models/models/research/slim'
!python train.py --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_coco.config --logtostderr这个过程在这里结束,但它需要开始使用“全局步骤”来训练模型。
2020-10-18 22:42:45.587477: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 168 of 2048
2020-10-18 22:42:55.668973: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 334 of 2048
2020-10-18 22:43:06.067869: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 379 of 2048
2020-10-18 22:43:15.705090: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 503 of 2048
2020-10-18 22:43:26.781151: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 576 of 2048
2020-10-18 22:43:38.120069: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 640 of 2048
2020-10-18 22:43:45.813089: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 708 of 2048
2020-10-18 22:43:58.071040: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 752 of 2048
2020-10-18 22:44:07.506961: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 828 of 2048
2020-10-18 22:44:16.355753: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 908 of 2048
2020-10-18 22:44:25.922348: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 960 of 2048
INFO:tensorflow:global_step/sec: 0
I1018 22:44:34.783342 140291121678080 supervisor.py:1099] global_step/sec: 0
2020-10-18 22:44:36.327813: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1036 of 2048
2020-10-18 22:44:45.651473: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1151 of 2048
2020-10-18 22:44:55.554234: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1186 of 2048
2020-10-18 22:45:05.648568: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1242 of 2048
2020-10-18 22:45:15.644396: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1313 of 2048
2020-10-18 22:45:25.551708: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1386 of 2048
2020-10-18 22:45:35.549003: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1458 of 2048
2020-10-18 22:45:45.648835: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1531 of 2048
2020-10-18 22:45:55.643920: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1602 of 2048
2020-10-18 22:46:05.559702: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1674 of 2048
2020-10-18 22:46:15.547609: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1746 of 2048
2020-10-18 22:46:25.645939: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1819 of 2048
INFO:tensorflow:global_step/sec: 0
I1018 22:46:35.052108 140291121678080 supervisor.py:1099] global_step/sec: 0
2020-10-18 22:46:35.645583: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1891 of 2048
2020-10-18 22:46:45.553851: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1962 of 2048
^C我能做些什么来解决这个问题呢?训练过程在我的PC (NVIDA GEFORCE RTX)上工作得很好,但我只需要通过Google Colab获得更多的计算能力。
发布于 2020-10-27 04:02:51
我无法运行你的代码,因为你在其中使用了一些文件。但我可以告诉你,这可能是因为你在GPU上使用TF 1,而在Colab中,当涉及到GPU时,降级并不容易。
例如,在您的代码中,我没有看到您将CUDA降级为您想要的版本,如下所示:
!wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
!dpkg -i cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
!apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
!apt-get update
!apt-get install cuda=9.0.176-1您可以通过!nvcc --version查看CUDA的版本。
而且Colab在降级TensorFlow版本方面也不是很快。您可能需要多次重新启动运行时。
我建议您将代码更改为TensorFlow 2
https://stackoverflow.com/questions/64419191
复制相似问题