我有两个NVidia图形处理器在机器上,但我不使用它们。
我在我的机器上运行了三个NN训练。当我试图运行第四个脚本时,脚本会给出以下错误:
my_user@my_machine:~/my_project/training_my_project$ python3 my_project.py
Traceback (most recent call last):
File "my_project.py", line 211, in <module>
load_data(
File "my_project.py", line 132, in load_data
tx = tf.convert_to_tensor(data_x, dtype=tf.float32)
File "/home/my_user/.local/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/my_user/.local/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Failed to allocate scratch buffer for device 0
my_user@my_machine:~/my_project/training_my_project$如何解决这个问题?
以下是我的RAM使用情况:
my_user@my_machine:~/my_project/training_my_project$ free -m
total used free shared buff/cache available
Mem: 15947 6651 3650 20 5645 8952
Swap: 2047 338 1709
my_user@my_machine:~/my_project/training_my_project$以下是我的CPU使用情况:
my_user@my_machine:~$ top -i
top - 12:46:12 up 79 days, 21:14, 2 users, load average: 4,05, 3,82, 3,80
Tasks: 585 total, 2 running, 583 sleeping, 0 stopped, 0 zombie
%Cpu(s): 11,7 us, 1,6 sy, 0,0 ni, 86,6 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
MiB Mem : 15947,7 total, 3638,3 free, 6662,7 used, 5646,7 buff/cache
MiB Swap: 2048,0 total, 1709,4 free, 338,6 used. 8941,6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2081821 my_user 20 0 48,9g 2,5g 471076 S 156,1 15,8 1832:54 python3
2082196 my_user 20 0 48,8g 2,6g 467708 S 148,5 16,8 1798:51 python3
2076942 my_user 20 0 47,8g 1,6g 466916 R 147,5 10,3 2797:51 python3
1594 gdm 20 0 3989336 65816 31120 S 0,7 0,4 38:03.14 gnome-shell
93 root rt 0 0 0 0 S 0,3 0,0 0:38.42 migration/13
1185 root -51 0 0 0 0 S 0,3 0,0 3925:59 irq/54-nvidia
2075861 root 20 0 0 0 0 I 0,3 0,0 1:30.17 kworker/22:0-events
2076418 root 20 0 0 0 0 I 0,3 0,0 1:38.65 kworker/1:0-events
2085325 root 20 0 0 0 0 I 0,3 0,0 1:17.15 kworker/3:1-events
2093002 root 20 0 0 0 0 I 0,3 0,0 1:00.05 kworker/23:0-events
2100000 root 20 0 0 0 0 I 0,3 0,0 0:45.78 kworker/2:2-events
2104688 root 20 0 0 0 0 I 0,3 0,0 0:33.08 kworker/9:0-events
2106767 root 20 0 0 0 0 I 0,3 0,0 0:25.16 kworker/20:0-events
2115469 root 20 0 0 0 0 I 0,3 0,0 0:01.98 kworker/11:2-events
2115470 root 20 0 0 0 0 I 0,3 0,0 0:01.96 kworker/12:2-events
2115477 root 20 0 0 0 0 I 0,3 0,0 0:01.95 kworker/30:1-events
2116059 my_user 20 0 23560 4508 3420 R 0,3 0,0 0:00.80 top以下是我的TF配置:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"] = "99" # Use both gpus for training.
import sys, random
import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint
import numpy as np
from lxml import etree, objectify
# <editor-fold desc="GPU">
# resolve GPU related issues.
try:
physical_devices = tf.config.list_physical_devices('GPU')
for gpu_instance in physical_devices:
tf.config.experimental.set_memory_growth(gpu_instance, True)
except Exception as e:
pass
# END of try
# </editor-fold>请把注释行作为注释行。
相关源代码:
def load_data(fname: str, class_index: int, feature_start_index: int, **selection):
i = 0
file = open(fname)
if "top_n_lines" in selection:
lines = [next(file) for _ in range(int(selection["top_n_lines"]))]
elif "random_n_lines" in selection:
tmp_lines = file.readlines()
lines = random.sample(tmp_lines, int(selection["random_n_lines"]))
else:
lines = file.readlines()
data_x, data_y = [], []
for l in lines:
row = l.strip().split()
x = [float(ix) for ix in row[feature_start_index:]]
y = encode(row[class_index])
data_x.append(x)
data_y.append(y)
# END for l in lines
num_rows = len(data_x)
given_fraction = selection.get("validation_part", 1.0)
if given_fraction > 0.9999:
valid_x, valid_y = data_x, data_y
else:
n = int(num_rows * given_fraction)
data_x, data_y = data_x[n:], data_y[n:]
valid_x, valid_y = data_x[:n], data_y[:n]
# END of if-else block
tx = tf.convert_to_tensor(data_x, np.float32)
ty = tf.convert_to_tensor(data_y, np.float32)
vx = tf.convert_to_tensor(valid_x, np.float32)
vy = tf.convert_to_tensor(valid_y, np.float32)
return tx, ty, vx, vy
# END of the function发布于 2022-02-10 17:53:01
使用多个GPU的
如果使用单个GPU在系统上进行开发,则可以使用虚拟设备模拟多个GPU。这样就可以轻松地测试多个GPU设置,而不需要额外的资源。
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Create 2 virtual GPUs with 1GB memory each
try:
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=1024),
tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)注意:初始化后无法修改虚拟设备。
一旦运行时可以使用多个逻辑GPU,就可以使用带有tf.distribute.Strategy或手动放置的多个GPU。
关于使用多个GPU的tf.distribute.Strategy最佳实践,下面是一个简单的示例:
tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_logical_devices('GPU')
strategy = tf.distribute.MirroredStrategy(gpus)
with strategy.scope():
inputs = tf.keras.layers.Input(shape=(1,))
predictions = tf.keras.layers.Dense(1)(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
model.compile(loss='mse',
optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))这个程序将在每个GPU上运行您的模型的副本,在它们之间分割输入数据,也称为“数据并行”。
发布于 2022-02-09 19:45:36
RAM的抱怨不是关于您的系统ram (称为CPU RAM)。是关于你的GPU RAM的。
在TF加载的那一刻,它会为自己分配所有的GPU (一些小部分由于页面大小的原因而留下)。
您的示例使TF动态地分配GPU,但它仍然可能耗尽所有GPU。使用下面的代码为每个进程提供对GPU的硬停止。您可能希望将1024更改为8096或类似的内容。
FYI,使用nvidia-smi来监控GPU内存的使用情况。
从医生那里:
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)https://stackoverflow.com/questions/71017766
复制相似问题