我有两张桌子:A(id, x, cid)和B(cid)。我需要从A取一些记录,不包括B中相同的记录。
我在A (cid, id)上有btree索引。
查询-1(非-IN):select id from a where cid not in (select distinct cid from b);
Seq Scan on a (cost=20418.20..340328.83 rows=5128145 width=8) (actual time=367.337..3820.175 rows=169046 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 10087249
SubPlan 1
-> HashAggregate (cost=20283.84..20391.33 rows=10749 width=4) (actual time=351.301..357.395 rows=41493 loops=1)
Group Key: b.cid
-> Seq Scan on b (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.530..148.063 rows=1132787 loops=1)
Planning time: 0.254 ms
Execution time: 3827.324 ms它不会生成index-only-scan,我认为这应该是合理的,因为索引btree on A(cid, id)是覆盖索引。
但是,如果我使用IN操作符,它可以生成index-only-scan,如下所示:
查询-2 (IN):select id from a where cid in (select distinct cid from b);
Nested Loop (cost=20284.27..420607.83 rows=10256290 width=8) (actual time=290.225..2349.182 rows=10087249 loops=1)
-> HashAggregate (cost=20283.84..20391.33 rows=10749 width=4) (actual time=290.162..304.054 rows=41493 loops=1)
Group Key: b.cid
-> Seq Scan on b (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.042..95.151 rows=1132787 loops=1)
-> Index Only Scan using idx_a_cid_id on a (cost=0.43..27.61 rows=961 width=12) (actual time=0.005..0.028 rows=243 loops=41493)
Index Cond: (cid = b.cid)
Heap Fetches: 0
Planning time: 0.197 ms
Execution time: 2672.898 ms如果您可能会考虑,因为A和B的基数太不同了,那就是事实。A包含10256295行,B包含41493 distinct cids。
但是,如果我使用相同的逻辑手动将Query-1重写到以下Query-3,但仅使用IN,postgres仍然可以生成一个index-only-scan,如下所示:
查询-3(IN-sub(NOT)):select id from a where cid in (select cid from a where cid not in (select distinct cid from b));
Nested Loop (cost=325220.51..722952.01 rows=10256290 width=8) (actual time=3741.854..5607.756 rows=169046 loops=1)
-> HashAggregate (cost=325220.07..325326.82 rows=10675 width=4) (actual time=3741.133..3763.512 rows=51758 loops=1)
Group Key: a_1.cid
-> Index Only Scan using idx_a_cid on a a_1 (cost=20418.64..312399.71 rows=5128145 width=4) (actual time=377.384..3691.304 rows=169046 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 10087249
Heap Fetches: 0
SubPlan 1
-> HashAggregate (cost=20283.84..20391.33 rows=10749 width=4) (actual time=359.015..365.864 rows=41493 loops=1)
Group Key: b.cid
-> Seq Scan on b (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.455..144.879 rows=1132787 loops=1)
-> Index Only Scan using idx_a_cid_id on a (cost=0.43..27.64 rows=961 width=12) (actual time=0.033..0.035 rows=3 loops=51758)
Index Cond: (cid = a_1.cid)
Heap Fetches: 0
Planning time: 2.758 ms
Execution time: 5617.930 ms所以我现在很困惑,是不是因为NOT-IN操作符本身太难/太昂贵而不能使用index-only-scan?还是因为PostgreSQL的查询优化器不够聪明而无法生成查询优化器?
顺便说一下,我的实验是在PostgreSQL9.6上进行的。
谢谢!
后续行动(3月28日至2019年):
使用jjane的方法,它有效!
SET cpu_index_tuple_cost TO 0;
SET random_page_cost TO 1;
explain analyze select id from a where cid not in (select distinct cid from b);
Index Only Scan using idx_a_cid_id on a (cost=20418.64..188115.26 rows=5128145 width=8) (actual time=308.552..1865.567 rows=169046 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 10087249
Heap Fetches: 0
SubPlan 1
-> HashAggregate (cost=20283.84..20391.33 rows=10749 width=4) (actual time=289.169..297.363 rows=41493 loops=1)
Group Key: b.cid
-> Seq Scan on b (cost=0.00..17451.87 rows=1132787 width=4) (actual time=0.020..94.442 rows=1132787 loops=1)
Planning time: 0.112 ms
Execution time: 1871.197 ms谢谢大家!
发布于 2019-03-28 12:51:52
它选择了它认为会更快的计划。
如果您使用SET cpu_index_tuple_cost TO 0和SET random_page_cost to 1,您可能会发现它只使用索引扫描。
同时也发现它并不是可靠的更好,或者至少不是很多。
https://dba.stackexchange.com/questions/233329
复制相似问题