文章/答案/技术大牛

发布

社区首页 >问答首页 >当尝试使用TFX运行一个非常简单的Kubeflow管道时，GKE集群上的pod为什么是OOMkilled？

问当尝试使用TFX运行一个非常简单的Kubeflow管道时，GKE集群上的pod为什么是OOMkilled？
EN

Stack Overflow用户

提问于 2021-09-14 08:21:43

回答 1查看 335关注 0票数 0

我正在遵循基于云AI平台管道的TFX教程，在Google上实现一个Kubeflow编排的管道。主要的区别在于，我试图实现一个对象检测解决方案，而不是教程中提出的Taxi应用程序。

出于这个原因，我(本地)创建了一个通过labelImg标记的图像数据集，并使用我上传到GS桶上的这个剧本将其转换为一个.tfrecord。然后，我按照TFX教程创建GKE集群(使用此配置的默认配置)和运行代码所需的朱庇特笔记本，导入了相同的模板。

主要区别在于管道的第一个组件，其中我将CSVExampleGen组件更改为一个CSVExampleGen组件：

def create_pipeline(
    pipeline_name: Text,
    pipeline_root: Text,
    data_path: Text,
    # TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
    # query: Text,
    preprocessing_fn: Text,
    run_fn: Text,
    train_args: tfx.proto.TrainArgs,
    eval_args: tfx.proto.EvalArgs,
    eval_accuracy_threshold: float,
    serving_model_dir: Text,
    metadata_connection_config: Optional[
        metadata_store_pb2.ConnectionConfig] = None,
    beam_pipeline_args: Optional[List[Text]] = None,
    ai_platform_training_args: Optional[Dict[Text, Text]] = None,
    ai_platform_serving_args: Optional[Dict[Text, Any]] = None,
) -> tfx.dsl.Pipeline:
  """Implements the chicago taxi pipeline with TFX."""

  components = []

  # Brings data into the pipeline or otherwise joins/converts training data.
  example_gen = tfx.components.ImportExampleGen(input_base=data_path)
  # TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
  # example_gen = tfx.extensions.google_cloud_big_query.BigQueryExampleGen(
  #     query=query)
  components.append(example_gen)

管道中没有插入其他组件，数据路径指向包含.tfrecord的存储桶上文件夹的位置：

DATA_PATH = 'gs://(project bucket)/(dataset folder)'

这是运行程序代码(基本上与TFX教程中的代码相同)：

def run():
  """Define a kubeflow pipeline."""

  # Metadata config. The defaults works work with the installation of
  # KF Pipelines using Kubeflow. If installing KF Pipelines using the
  # lightweight deployment option, you may need to override the defaults.
  # If you use Kubeflow, metadata will be written to MySQL database inside
  # Kubeflow cluster.
  metadata_config = tfx.orchestration.experimental.get_default_kubeflow_metadata_config(
  )

  runner_config = tfx.orchestration.experimental.KubeflowDagRunnerConfig(
      kubeflow_metadata_config=metadata_config,
      tfx_image=configs.PIPELINE_IMAGE)
  pod_labels = {
      'add-pod-env': 'true',
      tfx.orchestration.experimental.LABEL_KFP_SDK_ENV: 'tfx-template'
  }
  tfx.orchestration.experimental.KubeflowDagRunner(
      config=runner_config, pod_labels_to_attach=pod_labels
  ).run(
      pipeline.create_pipeline(
          pipeline_name=configs.PIPELINE_NAME,
          pipeline_root=PIPELINE_ROOT,
          data_path=DATA_PATH,
          # TODO(step 7): (Optional) Uncomment below to use BigQueryExampleGen.
          # query=configs.BIG_QUERY_QUERY,
          preprocessing_fn=configs.PREPROCESSING_FN,
          run_fn=configs.RUN_FN,
          train_args=tfx.proto.TrainArgs(num_steps=configs.TRAIN_NUM_STEPS),
          eval_args=tfx.proto.EvalArgs(num_steps=configs.EVAL_NUM_STEPS),
          eval_accuracy_threshold=configs.EVAL_ACCURACY_THRESHOLD,
          serving_model_dir=SERVING_MODEL_DIR,
          # TODO(step 7): (Optional) Uncomment below to use provide GCP related
          #               config for BigQuery with Beam DirectRunner.
          # beam_pipeline_args=configs
          # .BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS,
          # TODO(step 8): (Optional) Uncomment below to use Dataflow.
          # beam_pipeline_args=configs.DATAFLOW_BEAM_PIPELINE_ARGS,
          # TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
          # ai_platform_training_args=configs.GCP_AI_PLATFORM_TRAINING_ARGS,
          # TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
          # ai_platform_serving_args=configs.GCP_AI_PLATFORM_SERVING_ARGS,
      ))


if __name__ == '__main__':
  logging.set_verbosity(logging.INFO)
  run()

然后创建管道，并使用笔记本中的以下代码调用运行：

!tfx pipeline create  --pipeline-path=kubeflow_runner.py --endpoint={ENDPOINT} --build-image

!tfx run create --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}

问题是，虽然示例中的管道运行时没有问题，但这个管道总是失败，因为GKE集群上的pod与代码137 (OOMKilled).一起退出。

这是群集工作负载状态的快照。和这是崩溃的运行的完整日志转储。.

我已经尝试过缩小数据集大小(现在整个.tfrecord大约是6MB )，并将其在本地分成两组(验证和培训)，因为崩溃似乎发生在组件应该拆分数据集时，但这两种情况都没有改变。

你知道为什么它会失去记忆吗?我能采取什么措施来解决这个问题呢？

非常感谢。

google-cloud-platform

google-kubernetes-engine

kubeflow-pipelines

tfx

google-cloud-ai-platform-pipelines

回答 1

Stack Overflow用户

发布于 2021-09-17 05:26:58

如果应用程序出现内存泄漏或试图使用比设置限制数量更多的内存，Kubernetes将使用“OOMKilled容器限制已达到”事件终止它，并退出代码137。

当您看到这样的消息时，您有两个选择:增加结荚的限制或开始调试。例如，如果您的网站负载增加，那么调整限制将是有意义的。另一方面，如果内存使用是突然的或意外的，它可能意味着内存泄漏，您应该立即开始调试。

记住，库伯奈特斯杀死这样的豆荚是件好事-它阻止所有其他豆荚在同一个节点上运行。

也参考了类似的问题link1和link2，希望它是helps.Thanks

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69174333

复制

相似问题

问当尝试使用TFX运行一个非常简单的Kubeflow管道时，GKE集群上的pod为什么是OOMkilled？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当尝试使用TFX运行一个非常简单的Kubeflow管道时，GKE集群上的pod为什么是OOMkilled？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当尝试使用TFX运行一个非常简单的Kubeflow管道时，GKE集群上的pod为什么是OOMkilled？
EN