文章/答案/技术大牛

发布

社区首页 >问答首页 >谷歌Colab中用于TPU的TRANSIENT_ERROR

问谷歌Colab中用于TPU的TRANSIENT_ERROR
EN

Stack Overflow用户

提问于 2020-03-19 04:19:04

回答 1查看 609关注 0票数 0

我正在尝试使用tensorflow 2.0在TPU上运行lrcn keras模型。模型和生成器在CPU/GPU上工作，但我将它们包含在其中以供参考。我还初始化了TPU，它是可见的，一切看起来都很好，除了我运行.fit()时：

def frame_generator(self, batch_size, train_test, data_type):
    """Return a generator that we can use to train on. There are
    a couple different things we can return:
    data_type: 'features', 'images'
    """
    # Get the right dataset for the generator.
    train, test = self.split_train_test()
    data = train if train_test == 'train' else test

    #print("Creating %s generator with %d samples." % (train_test, len(data)))

    while 1:
        X, y = [], []

        # Generate batch_size samples.
        for _ in range(batch_size):
            if random.random() < .5:
                # real
                while True:
                    # Get a random sample.
                    sample = random.choice(data)

                    # Get the sequence from disk.
                    (_x,_y) = self.get_extracted_sequence(data_type, sample)

                    if _y==[0,1]:
                        break
            else:
                 # fake
                while True:
                    # Get a random sample.
                    sample = random.choice(data)

                    # Get the sequence from disk.
                    (_x,_y) = self.get_extracted_sequence(data_type, sample)

                    if _y==[1,0]:
                        break

            if _x is None:
                raise ValueError("Can't find sequence. Did you generate them?", sample)

            X.append(_x)
            y.append(_y)

        #yield [np.array(X), np.array(y)], np.array(y)
        yield np.array(X), np.array(y)

train_generator = data.frame_generator(batch_size, 'train', 'images')
val_generator = data.frame_generator(batch_size, 'test', 'images')

optimizer = Adam(lr=1e-5)

with tpu_strategy.scope():
  model = lrcn()
  model.add(tf.keras.layers.Dense(2, activation='softmax'))

  model.compile(loss='binary_crossentropy',
      optimizer=optimizer,
      metrics=['accuracy', tf.compat.v1.losses.log_loss])
  model.summary() 

train_data = tf.data.Dataset.from_generator(lambda:next(train_generator),
                                        (tf.float32, tf.int64),
                                        ([4, 32,299,299,3], [4,2])     
                                      )

val_data = tf.data.Dataset.from_generator(lambda:next(val_generator),
                                        (tf.float32, tf.int64),
                                      ([4, 32,299,299,3], [4,2]) 
                                      )


model.fit(x=train_data, steps_per_epoch=train_steps, validation_steps=test_steps,
      validation_data=val_data,
        epochs=30,
        callbacks=callbacks,
        verbose=1)

在model.fit上，我得到：

6421.0步训练，1605.0步验证

纪元1/30

() 15 epochs=30，16 callbacks=callbacks，-> 17 verbose=1中的UnavailableError回溯(最近一次调用)

11 raise_from中的frames /usr/local/lib/python3.6/dist-packages/six.py (值，from_value)

描述:通道处于状态TRANSIENT_FAILURE其他GRPC错误信息：{"created":"@1584561754.347859160"，“description”：“通道处于状态TRANSIENT_FAILURE"，"file":"external/grpc/src/core/ext/filters/client_channel/client_channel.cc"，"file_line":2294，"grpc_status":14} [Op:__inference_distributed_function_24182通道处于状态TRANSIENT_FAILURE"，"file":"external/grpc/src/core/ext/filters/client_channel/client_channel.cc"，"file_line":2294，"grpc_status":14} Op:__inference_distributed_function_10577

有什么办法解决这个问题吗？看起来像是在谷歌的网络终端上。

更新：

解决方案的一部分是，您不应该在colab笔记本中安装tensorflow2.1 with pip -您应该在"import tensorflow“之前在它自己的单元中使用。

%tensorflow_version 2.x

这会将TPU版本从1.15更改为>=2.1

现在，当我运行notebook时，我获得了更多详细信息：

6902.0步训练，1725.0步验证1/30

1/6902 ................- ETA: 20:04:55

最后回溯(最近一次调用)模式(自检，纪元，模式) 766 /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py：--> 767 NotFoundError on_epoch epoch_logs 768：

18帧节点：{{NotFoundError __inference_distributed_function_20824}}没有为与节点{{function_node PyFunc}}兼容的'CPU‘设备注册'PyFunc’OpKernel。已注册：

 [[PyFunc]]
 [[MultiDeviceIteratorGetNextFromShard]]
 [[RemoteCall]]
 [[IteratorGetNextAsOptional]]

在处理上述异常的过程中，发生了另一个异常：

如果不是self.model._in_multi_worker_mode( 1054 )或multi_worker_util.should_save_checkpoint()：-> 1055返回self.filepath.format(epoch=epoch + 1，如果不是multi_worker_util.should_save_checkpoint(1054)或multi_worker_util.should_save_checkpoint()：->1055返回self.filepath.format(epoch=epoch+1，**logs) 1056否则: 1057 #如果这是多工人培训，则此工人不应

KeyError：'val_accuracy‘

tensorflow

keras

google-colaboratory

tpu

google-cloud-tpu

回答 1

Stack Overflow用户

发布于 2020-03-19 07:43:55

TL/DR

在将python函数发送到TPU之前，您需要安装一个将执行python函数的较新版本。通过以下方式加载较新的构建

import requests
import os
url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/2.2.0-dev20200311'
resp = requests.post(url)
print(resp)
%pip install tf-nightly==2.2.0-dev20200311

来自https://github.com/tensorflow/tensorflow/issues/34346

当您使用Dataset.from_generator (或将生成器传递给Keras，Keras将在幕后调用它)时，Dataset会将生成器嵌入到其图中的PyFunc op中，每次调用op时，它都会在生成器上调用next，并获取结果字节。(基本上将Python视为一个黑盒。)

当所有东西都在同一台机器上运行时，这很好，但问题是TPU的工作方式是有一台单独的机器控制TPU (想象中称为TPU主机控制器。^^)，然后在TPU上运行程序，方法是将TensorFlow图形发送给它以供执行。因此，包含该PyFunc的图形将被发送到TPU，并且TPU无法执行它，因为TPU主机上没有Python。(即使有，它也不会是具有与本地机器相同状态的相同解释器。)所以它失败了，告诉你它不能执行PyFunc操作，但不幸的是，不是以一种非常明确的方式。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60746939

复制

相似问题

问谷歌Colab中用于TPU的TRANSIENT_ERROR
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问谷歌Colab中用于TPU的TRANSIENT_ERROREN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问谷歌Colab中用于TPU的TRANSIENT_ERROR
EN