文章/答案/技术大牛

发布

社区首页 >问答首页 >TensorFlow1.15，多GPU-1机，如何设置batch_size？

问TensorFlow1.15，多GPU-1机，如何设置batch_size？
EN

Data Science用户

提问于 2020-06-01 05:23:51

回答 2查看 581关注 0票数 1

输入函数代码：

    def input_fn(params):
        """The actual input function."""
        batch_size = FLAGS.train_batch_size

        name_to_features = {
            "input_ids":
                tf.FixedLenFeature([max_seq_length], tf.int64),
            "input_mask":
                tf.FixedLenFeature([max_seq_length], tf.int64),
            "segment_ids":
                tf.FixedLenFeature([max_seq_length], tf.int64),
            "masked_lm_positions":
                tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
            "masked_lm_ids":
                tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
            "masked_lm_weights":
                tf.FixedLenFeature([max_predictions_per_seq], tf.float32),
            "next_sentence_labels":
                tf.FixedLenFeature([1], tf.int64),
        }

        # For training, we want a lot of parallel reading and shuffling.
        # For eval, we want no shuffling and parallel reading doesn't matter.
        if is_training:
            d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
            d = d.repeat()
            d = d.shuffle(buffer_size=len(input_files))

            # `cycle_length` is the number of parallel files that get read.
            cycle_length = min(num_cpu_threads, len(input_files))

            # `sloppy` mode means that the interleaving is not exact. This adds
            # even more randomness to the training pipeline.
            d = d.apply(
                tf.contrib.data.parallel_interleave(
                    tf.data.TFRecordDataset,
                    sloppy=is_training,
                    cycle_length=cycle_length))
            d = d.shuffle(buffer_size=100)
        else:
            d = tf.data.TFRecordDataset(input_files)
            # Since we evaluate for a fixed number of steps we don't want to encounter
            # out-of-range exceptions.
            d = d.repeat()

        # We must `drop_remainder` on training because the TPU requires fixed
        # size dimensions. For eval, we assume we are evaluating on the CPU or GPU
        # and we *don't* want to drop the remainder, otherwise we wont cover
        # every sample.
        d = d.apply(
            tf.contrib.data.map_and_batch(
                lambda record: _decode_record(record, name_to_features),
                batch_size=batch_size,
                num_parallel_batches=num_cpu_threads,
                drop_remainder=True))
        d = d.prefetch(10)
        return d

镜像策略代码：

    distribution = tf.contrib.distribute.MirroredStrategy(
        devices=["device:GPU:%d" % i for i in range(FLAGS.n_gpus)],
        # num_gpus=4,
        cross_tower_ops=tf.distribute.HierarchicalCopyAllReduce())
    run_config = RunConfig(
        train_distribute=distribution,
        # eval_distribute=dist_strategy,
        log_step_count_steps=log_every_n_steps,
        model_dir=FLAGS.output_dir,
        save_checkpoints_steps=FLAGS.save_checkpoints_steps)

    model_fn = model_fn_builder(
        bert_config=bert_config,
        init_checkpoint=FLAGS.init_checkpoint,
        learning_rate=FLAGS.learning_rate,
        num_train_steps=FLAGS.num_train_steps,
        num_warmup_steps=FLAGS.num_warmup_steps,
        use_tpu=FLAGS.use_tpu,
        use_one_hot_embeddings=FLAGS.use_tpu)

    # If TPU is not available, this will fall back to normal Estimator on CPU
    # or GPU.
    estimator = Estimator(
        model_fn=model_fn,
        params={},
        config=run_config)

问题是如果我有4个GPU。每个GPU可以运行8批大小。我设置了batch_size = 8而不是32。batch_size = 32将OOM。

我说的对吗？数据会被分发到4个不同批次的GPU中吗？

deep-learning

tensorflow

bert

transformer

回答 2

Data Science用户

回答已采纳

发布于 2021-01-18 02:06:28

根据伯特-GPU的说法。

密码没问题。

batch_size是为一个GPU。

票数 0

Data Science用户

发布于 2020-06-04 21:03:50

如果使用Keras、Estimator或自定义培训循环，则Tensorflow在分发策略上处理批处理的方式不同。

由于您在一个工作人员(1台机器)中使用TF1.15估计器和MirroredStrategy，每个副本(每个GPU一个)将收到一个批处理大小的FLAGS.train_batch_size。因此，如果您有4个GPU，那么全局批处理大小将是4 * FLAGS.train_batch_size。

这是一个解释：

然而，在Estimator中，用户提供了一个input_fn，并且完全控制了他们希望自己的数据在工人和设备之间分布的方式。我们不进行批处理的自动拆分，也不对不同的工作人员自动分割数据。为每个工作人员调用一次提供的input_fn，从而为每个工作人员提供一个数据集。然后将该数据集中的一批输入到该工作人员的一个副本中，从而为一个工作人员的N个副本消耗N批。换句话说，input_fn返回的数据集应该提供批量PER_REPLICA_BATCH_SIZE。一个步骤的全局批次大小可以获得为PER_REPLICA_BATCH_SIZE * strategy.num_replicas_in_sync。

来源：TF1.X发行策略笔记本

票数 1

页面原文内容由Data Science提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://datascience.stackexchange.com/questions/75201

复制

相似问题

问TensorFlow1.15，多GPU-1机，如何设置batch_size？
EN

回答 2

Data Science用户

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TensorFlow1.15，多GPU-1机，如何设置batch_size？EN

回答 2

Data Science用户

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TensorFlow1.15，多GPU-1机，如何设置batch_size？
EN