文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Hive分区表优化join性能

问使用Hive分区表优化join性能
EN

Stack Overflow用户

提问于 2019-07-09 00:43:57

回答 1查看 672关注 0票数 1

我有一个蜂窝兽人test_dev_db.TransactionUpdateTable表与一些样本数据，其中将持有增量数据，需要更新到主表(test_dev_db.TransactionMainHistoryTable)，这是分区的列国家，Tran_date。

Hive增量加载表模式:它包含19行需要合并的行。

CREATE TABLE IF NOT EXISTS test_dev_db.TransactionUpdateTable 
(
Transaction_date timestamp,
Product       string,
Price         int,
Payment_Type  string,
Name          string, 
City          string,
State         string,
Country       string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;

配置单元主表架构:总行数为77。

CREATE TABLE IF NOT EXISTS test_dev_db.TransactionMainHistoryTable
(
Transaction_date timestamp,
Product       string,
Price         int,
Payment_Type  string,
Name          string,
City          string,
State         string
)
PARTITIONED BY (Country string,Tran_date string) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;

我正在运行下面的查询，以将增量数据与主表合并。

SELECT
  case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)  
  else t.transaction_date   end as transaction_date,
  t.product,
  case when i.price is not null then i.price else t.price end as price,
  t.payment_type,
  t.name,
  t.city,
  t.state,
  t.country,
  case when i.transaction_date is not null then substring(current_timestamp(),0,10) 
  else t.tran_date end as tran_date
  from
test_dev_db.TransactionMainHistoryTable t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-03-01
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-05-01

并运行下面的查询，过滤出需要合并的特定分区，以避免重写没有更新的分区。

SELECT
  case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)  
  else t.transaction_date   end as transaction_date,
  t.product,
  case when i.price is not null then i.price else t.price end as price,
  t.payment_type,
  t.name,
  t.city,
  t.state,
  t.country,
  case when i.transaction_date is not null then substring(current_timestamp(),0,10) else t.tran_date end as tran_date
  from
(SELECT 
  *
  FROM 
test_dev_db.TransactionMainHistoryTable
where Tran_date in
(select distinct  from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd') from test_dev_db.TransactionUpdateTable
))t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;

在这两种情况下，只需要更新Transaction_date、Price和partition列tran_date。这两个查询都运行得很好，尽管横向查询需要更长的执行时间。

分区表的执行计划为：

 Stage: Stage-5
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: transactionmainhistorytable
            filterExpr: tran_date is not null (type: boolean)
            Statistics: Num rows: 77 Data size: 39151 Basic stats: COMPLETE Column stats: COMPLETE
            Map Join Operator
              condition map:
                   Left Semi Join 0 to 1
              keys:
                0 tran_date (type: string)
                1 _col0 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8

我是不是对第二个查询做错了什么？为了更好地修剪，我是否需要同时使用这两个分区列。任何帮助或建议都是非常感谢的。

performance

hive

query-optimization

partitioning

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-07-09 02:39:03

也许这不是一个完整的答案，但我希望这些想法将是有用的。

where tran_date IN (select ... )

实际上与

LEFT SEMI JOIN (SELECT ...)

这一点反映在计划中：

Map Join Operator
              condition map:
                   Left Semi Join 0 to 1
              keys:
                0 tran_date (type: string)
                1 _col0 (type: string)

并将其作为map-join执行。首先选择子查询数据集，然后将其放置在分布式缓存中，加载到内存中以供map-join使用。所有这些步骤: select、load到内存、map-join比读取和覆盖所有的表要慢，因为它太小和过度分区:统计数据显示行数: 77数据大小: 39151 -太小而无法由两列分区，甚至太小而无法分区。尝试使用更大的表，并使用EXPLAIN EXTENDED检查正在扫描的内容。

另外，替换以下内容：

from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd')

使用substr(Transaction_date,0,10)或date(Transaction_date)

使用current_date和substring(current_timestamp,0,10)只是为了简化代码。

如果希望在计划中显示分区筛选器，请尝试替换作为分区列表传递的分区筛选器，您可以在单独的会话中选择该列表，并使用shell将分区列表传递到where子句中，请参见此答案：https://stackoverflow.com/a/56963448/2700344

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56939313

复制

相似问题

问使用Hive分区表优化join性能
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Hive分区表优化join性能EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Hive分区表优化join性能
EN