我正在处理一个需要很长时间才能运行的查询。由于这个查询无论如何都会通过cron作业自动执行,所以这并不重要,而且与仅嵌套查询相比,我喜欢CTE的可读性,在过去,我发现由于使用谓词,CTE的工作速度更快。
因此,我更喜欢在可能的情况下使用CTE而不使用子查询的可读性。
以下是运行相同查询及其explain语句的三种方法。我想更好地理解explain语句,以及了解哪种方法更好。
with
/*
cte for filtering all events that happened on this date only
*/
filter_sessions as (
select dimension1
from ga_flagship_ecom.sessions
where date = '2020-01-16'
),
ee as (
select
dimension1,
dimension3,
case when sum(case when metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product -- roll up to event level
from ga_flagship_ecom.ecom
where dimension1 in (select dimension1 from filter_sessions)
group by 1,2
)
select * from ee;在这里,我使用in()来过滤ee。解释:
CTE Scan on ee (cost=61757.45..61758.23 rows=39 width=44)
CTE filter_sessions
-> Index Only Scan using sessions_date_idx on sessions (cost=0.56..2.76 rows=1 width=22)
Index Cond: (date = '2020-01-16'::date)
CTE ee
-> GroupAggregate (cost=61753.72..61754.69 rows=39 width=34)
Group Key: ecom.dimension1, ecom.dimension3
-> Sort (cost=61753.72..61753.81 rows=39 width=34)
Sort Key: ecom.dimension1, ecom.dimension3
-> Nested Loop (cost=0.58..61752.69 rows=39 width=34)
-> HashAggregate (cost=0.02..0.03 rows=1 width=32)
Group Key: (filter_sessions.dimension1)::text
-> CTE Scan on filter_sessions (cost=0.00..0.02 rows=1 width=32)
-> Index Scan using ecom_pk on ecom (cost=0.56..61752.26 rows=39 width=34)另一种方法是使用内部联接进行筛选:
with
/*
cte for filtering all events that happened on this date only
*/
filter_sessions as (
select dimension1
from ga_flagship_ecom.sessions
where date = '2020-01-16'
),
ee as (
select
e.dimension1,
e.dimension3,
case when sum(case when e.metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product -- roll up to event level
from ga_flagship_ecom.ecom e
join filter_sessions f on f.dimension1 = e.dimension1
group by 1,2
)
select * from ee;此版本的解释如下:
CTE Scan on ee (cost=61757.43..61758.21 rows=39 width=44)
CTE filter_sessions
-> Index Only Scan using sessions_date_idx on sessions (cost=0.56..2.76 rows=1 width=22)
Index Cond: (date = '2020-01-16'::date)
CTE ee
-> GroupAggregate (cost=61753.70..61754.68 rows=39 width=34)
Group Key: e.dimension1, e.dimension3
-> Sort (cost=61753.70..61753.80 rows=39 width=34)
Sort Key: e.dimension1, e.dimension3
-> Nested Loop (cost=0.56..61752.67 rows=39 width=34)
-> CTE Scan on filter_sessions f (cost=0.00..0.02 rows=1 width=32)
-> Index Scan using ecom_pk on ecom e (cost=0.56..61752.26 rows=39 width=34)
Index Cond: ((dimension1)::text = (f.dimension1)::text)然后还有第三个选项,可以同时使用where和join:
with
/*
cte for filtering all events that happened on this date only
*/
filter_sessions as (
select dimension1
from ga_flagship_ecom.sessions
where date = '2020-01-16'
),
ee as (
select
e.dimension1,
e.dimension3,
case when sum(case when e.metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product -- roll up to event level
from ga_flagship_ecom.ecom e
join filter_sessions f on f.dimension1 = e.dimension1
where e.dimension1 in (select dimension1 from filter_sessions)
group by 1,2
)
select * from ee;这个版本的解释是:
CTE Scan on ee (cost=61758.32..61759.10 rows=39 width=44)
CTE filter_sessions
-> Index Only Scan using sessions_date_idx on sessions (cost=0.56..2.76 rows=1 width=22)
Index Cond: (date = '2020-01-16'::date)
CTE ee
-> GroupAggregate (cost=61754.59..61755.57 rows=39 width=34)
Group Key: e.dimension1, e.dimension3
-> Sort (cost=61754.59..61754.69 rows=39 width=34)
Sort Key: e.dimension1, e.dimension3
-> Nested Loop Semi Join (cost=0.56..61753.56 rows=39 width=34)
Join Filter: ((f.dimension1)::text = (filter_sessions.dimension1)::text)
-> Nested Loop (cost=0.56..61752.67 rows=39 width=66)
-> CTE Scan on filter_sessions f (cost=0.00..0.02 rows=1 width=32)
-> Index Scan using ecom_pk on ecom e (cost=0.56..61752.26 rows=39 width=34)我不太熟悉阅读explain语句,但3个语句顶部的“成本”非常相似: 61757.45、61757.43和61758.32。
哪种方法是最好的?无论是此时此地,还是更一般的地方?使用内部联接进行筛选或使用in()进行筛选
select version();
PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit发布于 2020-01-19 23:20:37
EXISTS子句的执行速度始终快于IN子句。你一定要试一下-
select dimension1,
dimension3,
case when sum(case when metric1 = 0
then 1
else 0
end) > 0
then 1
else 0
end as zero_val_product
from ga_flagship_ecom.ecom E
where EXISTS (select 1
from ga_flagship_ecom.sessions S
WHERE S.dimension1 = E.dimension1
AND date = '2020-01-16')
group by 1,2除此之外,为了更快地执行,你可以尝试在会话上建立索引(date,dimension1)
https://stackoverflow.com/questions/59811396
复制相似问题