文章/答案/技术大牛

发布

社区首页 >问答首页 >字符串选择的性能: vs正则表达式中的子字符串

问字符串选择的性能: vs正则表达式中的子字符串
EN

Database Administration用户

提问于 2021-02-16 10:31:48

回答 3查看 1.7K关注 0票数 0

我有一张有1张billion+记录的大桌子。以下是重要的栏目：

CREATE TABLE BigTable
(
   id BIGSERIAL PRIMARY KEY NOT NULL,
   hexid VARCHAR(255), -- usually 24 or 48 characters
   type VARCHAR(255),
   <and other columns>
);

我需要在hexid列和type列的子字符串上找到一些匹配的记录。hexid列上有一个索引。我需要搜索与hexid的最后24个字符匹配的字符串列表(我感兴趣的记录总是有48个字符hexids)。有两个查询我可以运行，我想知道哪一个会执行得更好：

使用WHERE IN查询1：

SELECT * FROM BigTable
WHERE substring(hexid, 25) IN (
    'deadbeef4b00018c09bf5db1',
    '5f478c50deadbeef97039344',
    '5fc0b855a6a8b600deadbeef')
AND type IN ('OneType', 'SomeType', 'AnotherType')
ORDER BY id ASC;

附有解释计划：

Sort  (cost=105476938.69..105477029.92 rows=36492 width=638)
  Sort Key: id
  ->  Seq Scan on BigTable  (cost=0.00..105468257.46 rows=36492 width=638)
        Filter: (((type)::text = ANY ('{OneType,SomeType,AnotherType}'::text[])) AND (""substring""((hexid)::text, 25) = ANY ('{deadbeef4b00018c09bf5db1,5f478c50deadbeef97039344,5fc0b855a6a8b600deadbeef}'::text[])))"

使用正则表达式的查询2：

SELECT * FROM BigTable
WHERE hexid ~ '\w{24}(deadbeef4b00018c09bf5db1|5f478c50deadbeef97039344|5fc0b855a6a8b600deadbeef)'
AND type ~ '(One|Some|Another)Type'
ORDER BY id ASC;

附有解释计划：

Sort  (cost=100022576.55..100022577.16 rows=243 width=638)
  Sort Key: id
  ->  Seq Scan on BigTable  (cost=0.00..100022566.92 rows=243 width=638)
        Filter: (((hexid)::text ~ '\w{24}(deadbeef4b00018c09bf5db1|5f478c50deadbeef97039344|5fc0b855a6a8b600deadbeef)'::text) AND ((type)::text ~ '(One|Some|Another)Type'::text))

查询1执行一个子字符串，没有模式匹配的快地，但是有WHERE IN子句会对性能产生负面影响。
查询2具有模式匹配(负责处理子字符串) 慢的，但没有WHERE IN子句不会对性能产生负面影响。

获取数据是一个一次性的过程。我没有时间单独运行这些查询来确定哪个查询的性能更好。我需要在知情的情况下决定该选哪一个。

哪个查询将提供最好的性能？
为什么第一个解释计划显示rows=36492，第二个rows=243显示？
也许还有另一个查询会表现得更好吗？请记住，由于这是一个一次性查询，添加额外的索引是没有帮助的。

(我有兴趣知道PostgreSQL的答案，但我想其他RDBMS也会有类似的答案。)

postgresql

query-performance

回答 3

Database Administration用户

回答已采纳

发布于 2021-02-16 10:46:53

使用IN的版本会稍微快一些，因为= ANY比~便宜。

估计值则相反，但这只是因为它估计了不同的结果行数。

您可以在更有选择的条件下使用索引使这些查询更快。

票数 1

Database Administration用户

发布于 2021-02-16 17:46:07

你可以得到一个相当好的估计：

CREATE TABLE foo AS SELECT * FROM bigtable LIMIT 1000000;

并在这个子集上运行您的查询，它应该比在大型表上快得多。因为postgres是慢的，IN()应该会以很大的优势获胜。我将尝试使用一组生成的数据：

create unlogged table foo (s text not null);
insert into foo select generate_series(1,1000000);
explain analyze select * from foo where substring( s,4 ) in ('123','456','789','052','984','412');
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..16758.33 rows=30000 width=6) (actual time=10.366..69.614 rows=5400 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on foo  (cost=0.00..12758.33 rows=12500 width=6) (actual time=6.496..63.969 rows=1800 loops=3)
         Filter: ("substring"(s, 4) = ANY ('{123,456,789,052,984,412}'::text[]))
         Rows Removed by Filter: 331533
 Planning Time: 0.431 ms
 Execution Time: 69.758 ms
(8 lignes)

Temps : 70,641 ms
test=> explain analyze select * from foo where s ~ '\w{4}(123|456|789|052|984|412)';
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..10643.33 rows=100 width=6) (actual time=338.414..339.924 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on foo  (cost=0.00..9633.33 rows=42 width=6) (actual time=335.171..335.171 rows=0 loops=3)
         Filter: (s ~ '\w{4}(123|456|789|052|984|412)'::text)
         Rows Removed by Filter: 333333
 Planning Time: 1.239 ms
 Execution Time: 339.950 ms
(8 lignes)

哎呀，哈哈。虽然，如果IN()中有许多项，强制它使用散列可能会更快：

explain analyze select * from foo join (values ('123'),('456'),('789'),('052'),('984'),('412')) v on substring(s,4)=v.column1;
 Gather  (cost=1000.15..14800.15 rows=30000 width=38) (actual time=10.628..61.902 rows=5400 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Hash Join  (cost=0.15..10800.15 rows=12500 width=38) (actual time=6.636..56.072 rows=1800 loops=3)
         Hash Cond: ("substring"(foo.s, 4) = "*VALUES*".column1)
         ->  Parallel Seq Scan on foo  (cost=0.00..8591.67 rows=416667 width=6) (actual time=0.010..15.803 rows=333333 loops=3)
         ->  Hash  (cost=0.08..0.08 rows=6 width=32) (actual time=0.011..0.011 rows=6 loops=3)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Values Scan on "*VALUES*"  (cost=0.00..0.08 rows=6 width=32) (actual time=0.003..0.006 rows=6 loops=3)
 Planning Time: 0.179 ms
 Execution Time: 62.063 ms

首先过滤文本字符串的长度()以消除不匹配的行，也可以加快速度。

就为了这该死的，clickhouse：

select * from foo where substr(s,10) in ('78593310911','21480020195','22970730260');
6 rows in set. Elapsed: 1.460 sec.
Processed 200.00 million rows, 5.68 GB (136.97 million rows/s., 3.89 GB/s.)

票数 0

Database Administration用户

发布于 2021-02-16 23:35:57

结束子字符串不能使用索引查找，因为没有启动键.

若要优化检查结束子字符串，请在计算列上创建索引：

REVERSE(hexid)

然后你可以过滤：

REVERSE(hexid) LIKE reversedSearchString + '%'

这意味着可以对索引使用范围检查，否则就不可能这样做。

票数 0

页面原文内容由Database Administration提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://dba.stackexchange.com/questions/285479

复制

相似问题

问字符串选择的性能: vs正则表达式中的子字符串
EN

回答 3

Database Administration用户

Database Administration用户

Database Administration用户

结束子字符串不能使用索引查找，因为没有启动键.

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字符串选择的性能: vs正则表达式中的子字符串EN

回答 3

Database Administration用户

Database Administration用户

Database Administration用户

结束子字符串不能使用索引查找，因为没有启动键.

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字符串选择的性能: vs正则表达式中的子字符串
EN