首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >当使用CTE时,最好使用join或in()进行过滤?

当使用CTE时,最好使用join或in()进行过滤?
EN

Stack Overflow用户
提问于 2020-01-19 23:11:48
回答 1查看 223关注 0票数 1

我正在处理一个需要很长时间才能运行的查询。由于这个查询无论如何都会通过cron作业自动执行,所以这并不重要,而且与仅嵌套查询相比,我喜欢CTE的可读性,在过去,我发现由于使用谓词,CTE的工作速度更快。

因此,我更喜欢在可能的情况下使用CTE而不使用子查询的可读性。

以下是运行相同查询及其explain语句的三种方法。我想更好地理解explain语句,以及了解哪种方法更好。

代码语言:javascript
复制
with 

/*
cte for filtering all events that happened on this date only
*/
filter_sessions as (
select dimension1
from ga_flagship_ecom.sessions
where date = '2020-01-16'
),


ee as (
select 
    dimension1,
    dimension3,
    case when sum(case when metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product -- roll up to event level
from ga_flagship_ecom.ecom
where dimension1 in (select dimension1 from filter_sessions)
group by 1,2
)

select * from ee;

在这里,我使用in()来过滤ee。解释:

代码语言:javascript
复制
CTE Scan on ee  (cost=61757.45..61758.23 rows=39 width=44)
  CTE filter_sessions
    ->  Index Only Scan using sessions_date_idx on sessions  (cost=0.56..2.76 rows=1 width=22)
          Index Cond: (date = '2020-01-16'::date)
  CTE ee
    ->  GroupAggregate  (cost=61753.72..61754.69 rows=39 width=34)
          Group Key: ecom.dimension1, ecom.dimension3
          ->  Sort  (cost=61753.72..61753.81 rows=39 width=34)
                Sort Key: ecom.dimension1, ecom.dimension3
                ->  Nested Loop  (cost=0.58..61752.69 rows=39 width=34)
                      ->  HashAggregate  (cost=0.02..0.03 rows=1 width=32)
                            Group Key: (filter_sessions.dimension1)::text
                            ->  CTE Scan on filter_sessions  (cost=0.00..0.02 rows=1 width=32)
                      ->  Index Scan using ecom_pk on ecom  (cost=0.56..61752.26 rows=39 width=34)

另一种方法是使用内部联接进行筛选:

代码语言:javascript
复制
with 

/*
cte for filtering all events that happened on this date only
*/
filter_sessions as (
select dimension1
from ga_flagship_ecom.sessions
where date = '2020-01-16'
),


ee as (
select 
    e.dimension1,
    e.dimension3,
    case when sum(case when e.metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product -- roll up to event level
from ga_flagship_ecom.ecom e
join filter_sessions f on f.dimension1 = e.dimension1
group by 1,2
)

select * from ee;

此版本的解释如下:

代码语言:javascript
复制
CTE Scan on ee  (cost=61757.43..61758.21 rows=39 width=44)
  CTE filter_sessions
    ->  Index Only Scan using sessions_date_idx on sessions  (cost=0.56..2.76 rows=1 width=22)
          Index Cond: (date = '2020-01-16'::date)
  CTE ee
    ->  GroupAggregate  (cost=61753.70..61754.68 rows=39 width=34)
          Group Key: e.dimension1, e.dimension3
          ->  Sort  (cost=61753.70..61753.80 rows=39 width=34)
                Sort Key: e.dimension1, e.dimension3
                ->  Nested Loop  (cost=0.56..61752.67 rows=39 width=34)
                      ->  CTE Scan on filter_sessions f  (cost=0.00..0.02 rows=1 width=32)
                      ->  Index Scan using ecom_pk on ecom e  (cost=0.56..61752.26 rows=39 width=34)
                            Index Cond: ((dimension1)::text = (f.dimension1)::text)

然后还有第三个选项,可以同时使用where和join:

代码语言:javascript
复制
with 

/*
cte for filtering all events that happened on this date only
*/
filter_sessions as (
select dimension1
from ga_flagship_ecom.sessions
where date = '2020-01-16'
),


ee as (
select 
    e.dimension1,
    e.dimension3,
    case when sum(case when e.metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product -- roll up to event level
from ga_flagship_ecom.ecom e
join filter_sessions f on f.dimension1 = e.dimension1
where e.dimension1 in (select dimension1 from filter_sessions)
group by 1,2
)

select * from ee;

这个版本的解释是:

代码语言:javascript
复制
CTE Scan on ee  (cost=61758.32..61759.10 rows=39 width=44)
  CTE filter_sessions
    ->  Index Only Scan using sessions_date_idx on sessions  (cost=0.56..2.76 rows=1 width=22)
          Index Cond: (date = '2020-01-16'::date)
  CTE ee
    ->  GroupAggregate  (cost=61754.59..61755.57 rows=39 width=34)
          Group Key: e.dimension1, e.dimension3
          ->  Sort  (cost=61754.59..61754.69 rows=39 width=34)
                Sort Key: e.dimension1, e.dimension3
                ->  Nested Loop Semi Join  (cost=0.56..61753.56 rows=39 width=34)
                      Join Filter: ((f.dimension1)::text = (filter_sessions.dimension1)::text)
                      ->  Nested Loop  (cost=0.56..61752.67 rows=39 width=66)
                            ->  CTE Scan on filter_sessions f  (cost=0.00..0.02 rows=1 width=32)
                            ->  Index Scan using ecom_pk on ecom e  (cost=0.56..61752.26 rows=39 width=34)

我不太熟悉阅读explain语句,但3个语句顶部的“成本”非常相似: 61757.45、61757.43和61758.32。

哪种方法是最好的?无论是此时此地,还是更一般的地方?使用内部联接进行筛选或使用in()进行筛选

代码语言:javascript
复制
select version();
PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-01-19 23:20:37

EXISTS子句的执行速度始终快于IN子句。你一定要试一下-

代码语言:javascript
复制
select dimension1,
       dimension3,
       case when sum(case when metric1 = 0
                               then 1
                          else 0
                     end) > 0
                 then 1
            else 0
       end as zero_val_product
from ga_flagship_ecom.ecom E
where EXISTS (select 1
              from ga_flagship_ecom.sessions S
              WHERE S.dimension1 = E.dimension1
              AND date = '2020-01-16')
group by 1,2

除此之外,为了更快地执行,你可以尝试在会话上建立索引(date,dimension1)

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59811396

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档