首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在Scala Spark应用程序中使用Impala JDBC

在Scala Spark应用程序中使用Impala JDBC
EN

Stack Overflow用户
提问于 2020-10-21 02:52:01
回答 1查看 307关注 0票数 2

我正在尝试使用Cloudera的Impala JDBC 2.6.17.1020连接器驱动程序和Spark,以便能够同时访问Kudu和Hive中的表。

当查询很简单时,它工作得很好,我得到了预期的输出。

代码语言:javascript
复制
val simpleQuery = """ SELECT *
                       FROM HIVE_TABLE
                       WHERE ABC = '123'
                   """
val simpleQueryDF : DataFrame = spark.read.jdbc(this.impalaJDBCString,"("+simpleQuery+") as temp",this.impalaConnectionProps)
simpleQueryDF.show()

但是,当查询包含嵌套查询和连接时,我会将标题重复为行,如下所示

代码语言:javascript
复制
+----------+--------+------------------------+--------------+ ...
|as_of_date|dq_index|impacted_data_lake_table|pk_column_name| ...
+----------+--------+------------------------+--------------+ ...
|as_of_date|dq_index|    impacted_data_lak...|pk_column_name| ...
|as_of_date|dq_index|    impacted_data_lak...|pk_column_name| ...
|as_of_date|dq_index|    impacted_data_lak...|pk_column_name| ...
+----------+--------+------------------------+--------------+ ...

我不知道为什么会发生这种情况,有人能建议我怎么做吗?

复杂查询如下所示

代码语言:javascript
复制
SELECT    '2020-10-22' AS AS_OF_DATE,
          DATA_QUERY.DQ_INDEX,
          DATA_QUERY.IMPACTED_DATA_LAKE_TABLE,
          DATA_QUERY.PK_COLUMN_NAME,
          DATA_QUERY.PK_OF_BAD_RECORD,
          DATA_QUERY.DQ_ASSESSMENT_DIMENSION,
          DATA_QUERY.SOURCE_SYSTEM,
          'ABC' AS BUSINESS_SUBJECT_AREA,
          DATA_QUERY.SUB_SUBJECT_AREA_CD,
          NVL(SBJ_AREA_DESC_TABLE.SUB_SUBJECT_AREA, 'Not Recognized') AS SUB_SUBJECT_AREA
FROM (
        SELECT  'DQ_INDEX_1' AS DQ_INDEX,
                'dq_data_governance.table_1 fin_year
                 dq_data_governance.table_2
                 dq_data_governance.table_3
                 dq_data_governance.table_4 ' AS IMPACTED_DATA_LAKE_TABLE,
                NAMES AS PK_COLUMN_NAME,
                CAST(VALS AS STRING) AS PK_OF_BAD_RECORD,
                'KLM' AS DQ_ASSESSMENT_DIMENSION,
                'NOQ' AS SOURCE_SYSTEM,
                'TTV' AS SUB_SUBJECT_AREA_CD
        FROM (
                SELECT 'table_4.SEGMENT4|table_4.SEGMENT8|table_1.FIN_YEAR' AS NAMES,
                        concat(GCC.SEGMENT4,'|',GCC.SEGMENT8,'|',CAST(FIN_YEAR as STRING)) AS VALS
                FROM dq_data_governance.table_1 FIN_YEAR,
                     dq_data_governance.table_2 gjh,
                     dq_data_governance.table_3 gjl,
                     dq_data_governance.table_4 GCC
                WHERE GJH.JE_HEADER_ID = GJL.JE_HEADER_ID
                  AND GJL.CODE_COMBINATION_ID = GCC.CODE_COMBINATION_ID
                  AND ACTUAL_FLAG = '5'
                  AND GCC.SEGMENT9 = '9'
                  AND EFFECTIVE_DATE BETWEEN FIN_YEAR.from_date AND FIN_YEAR.to_date 
                GROUP BY GCC.SEGMENT4,GCC.SEGMENT8,FIN_YEAR
                HAVING sum((nvl((GJL.ACCOUNTED_DR), 0) - nvl((GJL.ACCOUNTED_CR), 0))) <-0.1 
        ) AS T_DQ_INDEX_1
        
        UNION ALL 
        
        SELECT  'DQ_INDEX_2' AS DQ_INDEX,
                'dq_data_governance.table_5' AS IMPACTED_DATA_LAKE_TABLE,
                NAMES AS PK_COLUMN_NAME,
                CAST(VALS AS STRING) AS PK_OF_BAD_RECORD,
                'BKL' AS DQ_ASSESSMENT_DIMENSION,
                'CCG' AS SOURCE_SYSTEM,
                'LMA' AS SUB_SUBJECT_AREA_CD
        FROM (
                SELECT 'table_5.contractid|table_5.name' AS NAMES,
                        concat(CAST(c.contractid as STRING),'|',c.name) AS VALS
                FROM dq_data_governance.table_5 c
                INNER JOIN dq_data_governance.table_6 ps ON ps.processinstanceid=c.processinstanceid
                WHERE c.amountinriyal<1 and ps.approvalstatusid=3
        ) AS T_DQ_INDEX_2
        
        UNION ALL
        
        # ...
        # <MORE QUERIES CAN BE ADDED HERE>
        # ...
        
) AS DATA_QUERY
LEFT JOIN dq_data_governance.dq_sub_subject_area as SBJ_AREA_DESC_TABLE ON DATA_QUERY.SUB_SUBJECT_AREA_CD = SBJ_AREA_DESC_TABLE.SUB_SUBJECT_AREA_CD

相关:#1#2

EN

回答 1

Stack Overflow用户

发布于 2020-10-22 03:11:05

试试来自here的亚马逊版本的Impala驱动程序。

使用此驱动程序时,请重新检查表和数据帧之间的双重数据类型是否相等。

我希望它能为你工作。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64451608

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档