首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >DataFlow失败,返回代码1与气流DataflowHook.start_python_dataflow

DataFlow失败,返回代码1与气流DataflowHook.start_python_dataflow
EN

Stack Overflow用户
提问于 2018-06-06 22:10:26
回答 1查看 740关注 0票数 0

下面是我的密码。

当我运行下面的代码时,我得到了下面的错误。我正在尝试使用将google云存储中的gvcf/vcf文件转换为bigquery。

2018-06-06 16:46:42,589 {models.py:1428} INFO - 2018-06-06 21:46:34.252526 2018-06-06 16:46:42,589 {base_task_runner.py:115} INFO -运行:'bash','-c',U‘’airflow运行GcsToBigQuery gcsToBigquery_ID 2018-06-06T21:46:34.252526 -job_id 168 -原始-sd DAGS_base_task_runner.py:98/gcsToBigQuery.py‘{base_task_runner.py:98} INFO -子任务: 2018-06-06 16:46:43 202 {init.py:45} INFO -使用执行者SequentialExecutor 2018-06-06 16:46:43,284 {base_task_runner.py:98} INFO -子任务:2018-06 16:46:43,283 {models.py:189) INFO -从/apps/气流/dags/GcsToBigQuery.py 2018-06-06 16:46:43,853 {base_task_runner.py:98} INFO -子任务: 2018-06-06 16:46:43,852 {gcp_dataflow_hook.py:111} INFO -开始等待DataFlow进程完成。2018-06 16:46:46:46:46,931 {base_task_runner.py:98} INFO -子任务: 2018-06-06 16:46:46:46,930 {GcsToBigQuery.py:48}错误-状态:失败: gcsToBigquery:无法运行:DataFlow失败,返回代码1 2018-06-06 16:46:46,931 {base_task_runner.py:98} INFO -子任务: 2018-06-06 16:46:46,930 {python_operator.py:90} INFO -完成。返回值为:无

请帮我解决这个问题。谢谢!

代码语言:javascript
复制
from datetime import datetime, timedelta
from airflow import DAG
from airflow.contrib.hooks.gcp_dataflow_hook import DataFlowHook
from airflow.operators.python_operator import PythonOperator
import logging

default_args = {
    'owner': 'My Name',
    'depends_on_past': False,
    'start_date': datetime(2018, 6, 6),
    'email': ['MY Email'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG('GcsToBigQuery', default_args=default_args,
          description='To move GVCF/VCF files from Google Cloud Storage to Big Query',
          schedule_interval='@once',
          start_date=datetime(2018, 6, 6))

dataflow_py_file = 'gcp_variant_transforms.vcf_to_bq'
PY_OPTIONS = ['-m']

DATAFLOW_OPTIONS_PY = {
    "project": "project-Name",
    "input_pattern": "gs://test-gvcf/1000-genomes.vcf",
    "output_table": "trc-mayo-projectsample:1000genomicsID.1000_genomesSamp",
     "staging_location": "gs://test-gvcf/vcftobq/staging",
     "temp_location": "gs://test-gvcf/vcftobq/temp",
     "job_name": "dataflowstarter25",
     #"setup_file": "./setup.py",
     "runner": "DataflowRunner"
}


def gcsToBigquery():
    try:
        dataflowHook = DataFlowHook(gcp_conn_id='google_cloud_platform_id')
        dataflowHook.start_python_dataflow(task_id='dataflowStarter2_ID', variables=DATAFLOW_OPTIONS_PY,
                                       dataflow=dataflow_py_file, py_options=PY_OPTIONS)
    except Exception as e:
        logging.error("Status : FAIL : gcsToBigquery: Not able to run: " + str(e.message))

gcsToBigquery_task = PythonOperator(task_id='gcsToBigquery_ID',
                                    python_callable=gcsToBigquery,
                                    dag=dag)
EN

回答 1

Stack Overflow用户

发布于 2018-10-11 13:14:28

通过使用DataflowPythonOperator并在云实例中安装gcp_variant_transforms API,可以避免此问题。

安装所需组件的命令:

代码语言:javascript
复制
sudo pip install git+https://github.com/googlegenomics/gcp-variant-transforms.git

如果有人也有此问题,您可以查看这个帖子,其中详细说明了斯里坎成功执行代码所遵循的步骤。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50730358

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档