文章/答案/技术大牛

发布

社区首页 >问答首页 >如何进行多神经网络训练？

问如何进行多神经网络训练？
EN

Stack Overflow用户

提问于 2022-02-07 11:23:58

回答 2查看 161关注 0票数 3

我有两个NVidia图形处理器在机器上，但我不使用它们。

我在我的机器上运行了三个NN训练。当我试图运行第四个脚本时，脚本会给出以下错误：

my_user@my_machine:~/my_project/training_my_project$ python3 my_project.py
Traceback (most recent call last):
  File "my_project.py", line 211, in <module>
    load_data(
  File "my_project.py", line 132, in load_data
    tx = tf.convert_to_tensor(data_x, dtype=tf.float32)
  File "/home/my_user/.local/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/my_user/.local/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Failed to allocate scratch buffer for device 0
my_user@my_machine:~/my_project/training_my_project$

如何解决这个问题？

以下是我的RAM使用情况：

my_user@my_machine:~/my_project/training_my_project$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15947        6651        3650          20        5645        8952
Swap:          2047         338        1709
my_user@my_machine:~/my_project/training_my_project$

以下是我的CPU使用情况：

my_user@my_machine:~$ top -i
top - 12:46:12 up 79 days, 21:14,  2 users,  load average: 4,05, 3,82, 3,80
Tasks: 585 total,   2 running, 583 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11,7 us,  1,6 sy,  0,0 ni, 86,6 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem :  15947,7 total,   3638,3 free,   6662,7 used,   5646,7 buff/cache
MiB Swap:   2048,0 total,   1709,4 free,    338,6 used.   8941,6 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2081821 my_user  20   0   48,9g   2,5g 471076 S 156,1  15,8   1832:54 python3
2082196 my_user  20   0   48,8g   2,6g 467708 S 148,5  16,8   1798:51 python3
2076942 my_user  20   0   47,8g   1,6g 466916 R 147,5  10,3   2797:51 python3
   1594 gdm       20   0 3989336  65816  31120 S   0,7   0,4  38:03.14 gnome-shell
     93 root      rt   0       0      0      0 S   0,3   0,0   0:38.42 migration/13
   1185 root     -51   0       0      0      0 S   0,3   0,0   3925:59 irq/54-nvidia
2075861 root      20   0       0      0      0 I   0,3   0,0   1:30.17 kworker/22:0-events
2076418 root      20   0       0      0      0 I   0,3   0,0   1:38.65 kworker/1:0-events
2085325 root      20   0       0      0      0 I   0,3   0,0   1:17.15 kworker/3:1-events
2093002 root      20   0       0      0      0 I   0,3   0,0   1:00.05 kworker/23:0-events
2100000 root      20   0       0      0      0 I   0,3   0,0   0:45.78 kworker/2:2-events
2104688 root      20   0       0      0      0 I   0,3   0,0   0:33.08 kworker/9:0-events
2106767 root      20   0       0      0      0 I   0,3   0,0   0:25.16 kworker/20:0-events
2115469 root      20   0       0      0      0 I   0,3   0,0   0:01.98 kworker/11:2-events
2115470 root      20   0       0      0      0 I   0,3   0,0   0:01.96 kworker/12:2-events
2115477 root      20   0       0      0      0 I   0,3   0,0   0:01.95 kworker/30:1-events
2116059 my_user  20   0   23560   4508   3420 R   0,3   0,0   0:00.80 top

以下是我的TF配置：

import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"] = "99" # Use both gpus for training.


import sys, random
import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint
import numpy as np
from lxml import etree, objectify


# <editor-fold desc="GPU">
# resolve GPU related issues.
try:
    physical_devices = tf.config.list_physical_devices('GPU') 
    for gpu_instance in physical_devices: 
        tf.config.experimental.set_memory_growth(gpu_instance, True)
except Exception as e:
    pass
# END of try
# </editor-fold>

请把注释行作为注释行。

相关源代码：

def load_data(fname: str, class_index: int, feature_start_index: int, **selection):
    i = 0
    file = open(fname)
    if "top_n_lines" in selection:
        lines = [next(file) for _ in range(int(selection["top_n_lines"]))]
    elif "random_n_lines" in selection:
        tmp_lines = file.readlines()
        lines = random.sample(tmp_lines, int(selection["random_n_lines"]))
    else:
        lines = file.readlines()

    data_x, data_y = [], []
    for l in lines:
        row = l.strip().split()
        x = [float(ix) for ix in row[feature_start_index:]]
        y = encode(row[class_index])
        data_x.append(x)
        data_y.append(y)  
    # END for l in lines

    num_rows = len(data_x)
    given_fraction = selection.get("validation_part", 1.0)
    if given_fraction > 0.9999:
        valid_x, valid_y = data_x, data_y
    else:
        n = int(num_rows * given_fraction)
        data_x, data_y = data_x[n:], data_y[n:]
        valid_x, valid_y = data_x[:n], data_y[:n]
    # END of if-else block

    tx = tf.convert_to_tensor(data_x, np.float32)
    ty = tf.convert_to_tensor(data_y, np.float32)
    
    vx = tf.convert_to_tensor(valid_x, np.float32)
    vy = tf.convert_to_tensor(valid_y, np.float32)  

    return tx, ty, vx, vy
# END of the function

python

tensorflow

gpu

回答 2

Stack Overflow用户

发布于 2022-02-10 17:53:01

使用多个GPU的

如果使用单个GPU在系统上进行开发，则可以使用虚拟设备模拟多个GPU。这样就可以轻松地测试多个GPU设置，而不需要额外的资源。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Create 2 virtual GPUs with 1GB memory each
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024),
         tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

注意:初始化后无法修改虚拟设备。

一旦运行时可以使用多个逻辑GPU，就可以使用带有tf.distribute.Strategy或手动放置的多个GPU。

关于使用多个GPU的tf.distribute.Strategy最佳实践，下面是一个简单的示例：

tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_logical_devices('GPU')
strategy = tf.distribute.MirroredStrategy(gpus)
with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(1,))
  predictions = tf.keras.layers.Dense(1)(inputs)
  model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
  model.compile(loss='mse',
                optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))

这个程序将在每个GPU上运行您的模型的副本，在它们之间分割输入数据，也称为“数据并行”。

有关分配策略或人工放置的更多信息，请查看链接上的指南。

票数 1

Stack Overflow用户

发布于 2022-02-09 19:45:36

RAM的抱怨不是关于您的系统ram (称为CPU RAM)。是关于你的GPU RAM的。

在TF加载的那一刻，它会为自己分配所有的GPU (一些小部分由于页面大小的原因而留下)。

您的示例使TF动态地分配GPU，但它仍然可能耗尽所有GPU。使用下面的代码为每个进程提供对GPU的硬停止。您可能希望将1024更改为8096或类似的内容。

FYI，使用nvidia-smi来监控GPU内存的使用情况。

从医生那里：

生长

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71017766

复制

相似问题

问如何进行多神经网络训练？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何进行多神经网络训练？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何进行多神经网络训练？
EN