文章/答案/技术大牛

发布

社区首页 >问答首页 >OoM: tensorflow模型上Talos优化超参数过程中的内存错误

问OoM: tensorflow模型上Talos优化超参数过程中的内存错误
EN

Stack Overflow用户

提问于 2020-11-09 00:33:14

回答 1查看 356关注 0票数 0

当我在Talos的帮助下为我的AlexNet寻找最优的超参数时，我的内存错误消失了。它总是发生在同一个时代(32/240)，即使我稍微改变了参数(排除原因是一个不利的星座)。

错误消息：

ResourceExhaustedError:  OOM when allocating tensor with shape[32,96,26,26] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node max_pooling2d_1/MaxPool (defined at D:\anaconda\envs\tf_ks\lib\site-packages\keras\backend\tensorflow_backend.py:3009) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_keras_scratch_graph_246047]

Function call stack:
keras_scratch_graph

这是我的密码：

会话配置：

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth=True
config.gpu_options.per_process_gpu_memory_fraction = 0.99
sess = tf.compat.v1.Session(config = config)
K.set_session(sess)

配置与AlexNet:的拟合

def alexnet(x_train, y_train, x_val, y_val, params):
    
    K.clear_session()
    
    if params['activation'] == 'leakyrelu':
        activation_layer = LeakyReLU(alpha = params['leaky_alpha'])
    elif params['activation'] == 'relu':
        activation_layer = ReLU()
    
    model = Sequential([
        Conv2D(filters=96, kernel_size=(11,11), strides=(4,4), activation='relu', input_shape=(224,224,Global.num_image_channels)),
        BatchNormalization(),
        MaxPooling2D(pool_size=(3,3), strides=(2,2)),
        Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), activation='relu', padding="same"),
        BatchNormalization(),
        MaxPooling2D(pool_size=(3,3), strides=(2,2)),
        Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),
        BatchNormalization(),
        Conv2D(filters=384, kernel_size=(1,1), strides=(1,1), activation='relu', padding="same"),
        BatchNormalization(),
        Conv2D(filters=256, kernel_size=(1,1), strides=(1,1), activation='relu', padding="same"),
        BatchNormalization(),
        MaxPooling2D(pool_size=(3,3), strides=(2,2)),
        Flatten(),
        Dense(4096, activation=activation_layer),
        Dropout(0.5),#todo
        Dense(4096, activation=activation_layer),
        Dropout(0.5),#todo
        Dense(units = 2, activation=activation_layer)
        #Dense(10, activation='softmax')
    ])
        
    model.compile(
        optimizer = params['optimizer'](lr = lr_normalizer(params['lr'], params['optimizer'])), 
        loss = Global.loss_funktion, 
        metrics = get_reduction_metric(Global.reduction_metric)
    )
    train_generator, valid_generator = create_data_pipline(params['batch_size'], params['samples'])
    tg_steps_per_epoch = train_generator.n // train_generator.batch_size
    vg_validation_steps = valid_generator.n // valid_generator.batch_size
    print('Steps per Epoch: {}, Validation Steps: {}'.format(tg_steps_per_epoch, vg_validation_steps))
    
    
    startTime = datetime.now()
    
    out = model.fit(
        x = train_generator,
        epochs = params['epochs'],
        validation_data = valid_generator,
        steps_per_epoch = tg_steps_per_epoch,
        validation_steps = vg_validation_steps,
        #callbacks = [checkpointer]
        workers = 8
    )
    print("Time taken:", datetime.now() - startTime)

    return out, model

超参数列表：

hyper_parameter = {
    'samples': [20000],
    'epochs': [1],
    'batch_size': [32, 64],
    'optimizer': [Adam],
    'lr': [1, 2],
    'first_neuron': [1024, 2048, 4096],
    'dropout': [0.25, 0.50],
    'activation': ['leakyrelu', 'relu'],
    'hidden_layers': [0, 1, 2, 3, 4],
    'leaky_alpha': [0.1] #Default bei LeakyReLU, sonst PReLU
}

Run Talos:

dummy_x = np.empty((1, 2, 3, 224, 224))
dummy_y = np.empty((1, 2))

with tf.device('/device:GPU:0'):
    
        t = ta.Scan(
            x = dummy_x,
            y = dummy_y,
            model = alexnet,
            params = hyper_parameter,
            experiment_name = '{}'.format(Global.dataset),
            #shuffle=False,
            reduction_metric = Global.reduction_metric,
            disable_progress_bar = False,
            print_params = True,
            clear_session = 'tf',
            save_weights = False
        )
        

t.data.to_csv(Global.target_dir + Global.results, index = True)

内存的使用率总是很高，但不会超过时代，但变化不大。

Nvidia SMI输出：

有人能帮帮我吗？

我已经尝试过的东西：========================================================================== ：

1)拆分Talos运行：

这导致了同样的错误。

hyper_parameter = {
    'samples': [20000],
    'epochs': [1],
    'batch_size': [32, 64],
    'optimizer': [Adam],
    'lr': [1, 2, 3, 5],
    'first_neuron': [9999],
    'dropout': [0.25, 0.50],
    'activation': ['leakyrelu', 'relu'],
    'hidden_layers': [9999],
    'leaky_alpha': [0.1] #Default bei LeakyReLU, sonst PReLU
}

dummy_x = np.empty((1, 2, 3, 224, 224))
dummy_y = np.empty((1, 2))
first = True

for h in [0, 1, 2, 3, 4]:
    hyper_parameter['hidden_layers']=[h]
    for fn in [1024, 2048, 4096]:
        hyper_parameter['first_neuron']=[fn]

        with tf.device('/device:GPU:1'):

                t = ta.Scan(
                    x = dummy_x,
                    y = dummy_y,
                    model = alexnet,
                    params = hyper_parameter,
                    experiment_name = '{}'.format(Global.dataset),
                    #shuffle=False,
                    reduction_metric = Global.reduction_metric,
                    disable_progress_bar = False,
                    print_params = True,
                    clear_session = 'tf',
                    save_weights = False
                )
                if(first):
                    t.data.to_csv(Global.target_dir + Global.results, index = True, mode='a')
                    first = False
                else:
                    t.data.to_csv(Global.target_dir + Global.results, index = True, mode='a', header=False)

==========================================================================

2)在自己的线程中运行模型

寻找原因时，我发现有些人抱怨同样的问题，并指责TensorFlow没有执行K.clear_session()。

也许这个想法是愚蠢的，但是我试着用一个额外的线来训练模型。

from threading import Thread
def gen_model_thread(x_train, y_train, x_val, y_val, params):
    
    thread = Thread(target=alexnet, args=(x_train, y_train, x_val, y_val, params))
    thread.start()
    return_value = thread.join()
    return return_value

with tf.device('/device:GPU:0'):
    
        t = ta.Scan(
            x = dummy_x,
            y = dummy_y,
            model = gen_model_thread,
            params = hyper_parameter,
            experiment_name = '{}'.format(Global.dataset),
            #shuffle=False,
            reduction_metric = Global.reduction_metric,
            disable_progress_bar = False,
            print_params = True,
            clear_session = True,
            save_weights = False
        )

这导致类型错误：

Traceback (most recent call last):
  File "D:\anaconda\envs\tf_ks\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "D:\anaconda\envs\tf_ks\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "<ipython-input-3-2942ae0a0a56>", line 5, in gen_model
    model = alexnet(params['activation'], params['leaky_alpha'])
  File "<ipython-input-2-2a405202aa5a>", line 27, in alexnet
    Dense(units = 2, activation=activation_layer)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\sequential.py", line 94, in __init__
    self.add(layer)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\sequential.py", line 162, in add
    name=layer.name + '_input')
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\input_layer.py", line 178, in Input
    input_tensor=tensor)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\input_layer.py", line 87, in __init__
    name=self.name)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\backend\tensorflow_backend.py", line 73, in symbolic_fn_wrapper
    if _SYMBOLIC_SCOPE.value:
AttributeError: '_thread._local' object has no attribute 'value'

TypeError: cannot unpack non-iterable NoneType object

我知道，我最后的机会是手动做，但我想我会走向同样的问题，同时训练我的模式，无论如何。

非常感谢你解决了我的问题，阅读了我的问题，纠正了我课文中的拼写错误。

我期待着从这个令人惊叹的社区得到建设性的解决方案！

==========================================================================

GPU: NVIDIA RTX 2080 RTX和泰坦Xp收集器版(我都试过了)

TensorFlow: 2.1.0

Keras: 2.3.1

塔罗斯: 1.0

talos

python

tensorflow

keras

hyperparameters

回答 1

Stack Overflow用户

发布于 2021-02-01 11:46:35

禁用急切的执行解决了我的问题: tf.compat.v1.disable_eager_execution()

https://github.com/autonomio/talos/issues/482

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64744483

复制

相似问题

问OoM: tensorflow模型上Talos优化超参数过程中的内存错误
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问OoM: tensorflow模型上Talos优化超参数过程中的内存错误EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问OoM: tensorflow模型上Talos优化超参数过程中的内存错误
EN