我有一张有1张billion+记录的大桌子。以下是重要的栏目:
CREATE TABLE BigTable
(
id BIGSERIAL PRIMARY KEY NOT NULL,
hexid VARCHAR(255), -- usually 24 or 48 characters
type VARCHAR(255),
<and other columns>
);我需要在hexid列和type列的子字符串上找到一些匹配的记录。hexid列上有一个索引。我需要搜索与hexid的最后24个字符匹配的字符串列表(我感兴趣的记录总是有48个字符hexids)。有两个查询我可以运行,我想知道哪一个会执行得更好:
使用WHERE IN查询1:
SELECT * FROM BigTable
WHERE substring(hexid, 25) IN (
'deadbeef4b00018c09bf5db1',
'5f478c50deadbeef97039344',
'5fc0b855a6a8b600deadbeef')
AND type IN ('OneType', 'SomeType', 'AnotherType')
ORDER BY id ASC;附有解释计划:
Sort (cost=105476938.69..105477029.92 rows=36492 width=638)
Sort Key: id
-> Seq Scan on BigTable (cost=0.00..105468257.46 rows=36492 width=638)
Filter: (((type)::text = ANY ('{OneType,SomeType,AnotherType}'::text[])) AND (""substring""((hexid)::text, 25) = ANY ('{deadbeef4b00018c09bf5db1,5f478c50deadbeef97039344,5fc0b855a6a8b600deadbeef}'::text[])))"使用正则表达式的查询2:
SELECT * FROM BigTable
WHERE hexid ~ '\w{24}(deadbeef4b00018c09bf5db1|5f478c50deadbeef97039344|5fc0b855a6a8b600deadbeef)'
AND type ~ '(One|Some|Another)Type'
ORDER BY id ASC;附有解释计划:
Sort (cost=100022576.55..100022577.16 rows=243 width=638)
Sort Key: id
-> Seq Scan on BigTable (cost=0.00..100022566.92 rows=243 width=638)
Filter: (((hexid)::text ~ '\w{24}(deadbeef4b00018c09bf5db1|5f478c50deadbeef97039344|5fc0b855a6a8b600deadbeef)'::text) AND ((type)::text ~ '(One|Some|Another)Type'::text))WHERE IN子句会对性能产生负面影响。WHERE IN子句不会对性能产生负面影响。获取数据是一个一次性的过程。我没有时间单独运行这些查询来确定哪个查询的性能更好。我需要在知情的情况下决定该选哪一个。
rows=36492,第二个rows=243显示?(我有兴趣知道PostgreSQL的答案,但我想其他RDBMS也会有类似的答案。)
发布于 2021-02-16 10:46:53
使用IN的版本会稍微快一些,因为= ANY比~便宜。
估计值则相反,但这只是因为它估计了不同的结果行数。
您可以在更有选择的条件下使用索引使这些查询更快。
发布于 2021-02-16 17:46:07
你可以得到一个相当好的估计:
CREATE TABLE foo AS SELECT * FROM bigtable LIMIT 1000000;并在这个子集上运行您的查询,它应该比在大型表上快得多。因为postgres是慢的,IN()应该会以很大的优势获胜。我将尝试使用一组生成的数据:
create unlogged table foo (s text not null);
insert into foo select generate_series(1,1000000);
explain analyze select * from foo where substring( s,4 ) in ('123','456','789','052','984','412');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..16758.33 rows=30000 width=6) (actual time=10.366..69.614 rows=5400 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on foo (cost=0.00..12758.33 rows=12500 width=6) (actual time=6.496..63.969 rows=1800 loops=3)
Filter: ("substring"(s, 4) = ANY ('{123,456,789,052,984,412}'::text[]))
Rows Removed by Filter: 331533
Planning Time: 0.431 ms
Execution Time: 69.758 ms
(8 lignes)
Temps : 70,641 ms
test=> explain analyze select * from foo where s ~ '\w{4}(123|456|789|052|984|412)';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..10643.33 rows=100 width=6) (actual time=338.414..339.924 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on foo (cost=0.00..9633.33 rows=42 width=6) (actual time=335.171..335.171 rows=0 loops=3)
Filter: (s ~ '\w{4}(123|456|789|052|984|412)'::text)
Rows Removed by Filter: 333333
Planning Time: 1.239 ms
Execution Time: 339.950 ms
(8 lignes)哎呀,哈哈。虽然,如果IN()中有许多项,强制它使用散列可能会更快:
explain analyze select * from foo join (values ('123'),('456'),('789'),('052'),('984'),('412')) v on substring(s,4)=v.column1;
Gather (cost=1000.15..14800.15 rows=30000 width=38) (actual time=10.628..61.902 rows=5400 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Join (cost=0.15..10800.15 rows=12500 width=38) (actual time=6.636..56.072 rows=1800 loops=3)
Hash Cond: ("substring"(foo.s, 4) = "*VALUES*".column1)
-> Parallel Seq Scan on foo (cost=0.00..8591.67 rows=416667 width=6) (actual time=0.010..15.803 rows=333333 loops=3)
-> Hash (cost=0.08..0.08 rows=6 width=32) (actual time=0.011..0.011 rows=6 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Values Scan on "*VALUES*" (cost=0.00..0.08 rows=6 width=32) (actual time=0.003..0.006 rows=6 loops=3)
Planning Time: 0.179 ms
Execution Time: 62.063 ms首先过滤文本字符串的长度()以消除不匹配的行,也可以加快速度。
就为了这该死的,clickhouse:
select * from foo where substr(s,10) in ('78593310911','21480020195','22970730260');
6 rows in set. Elapsed: 1.460 sec.
Processed 200.00 million rows, 5.68 GB (136.97 million rows/s., 3.89 GB/s.)发布于 2021-02-16 23:35:57
若要优化检查结束子字符串,请在计算列上创建索引:
REVERSE(hexid)然后你可以过滤:
REVERSE(hexid) LIKE reversedSearchString + '%'这意味着可以对索引使用范围检查,否则就不可能这样做。
https://dba.stackexchange.com/questions/285479
复制相似问题