文章/答案/技术大牛

发布

社区首页 >问答首页 >Postgres:使用IN vs JOIN vs选择大量行

问Postgres:使用IN vs JOIN vs选择大量行
EN

Stack Overflow用户

提问于 2017-12-15 05:15:17

回答 1查看 359关注 0票数 1

我需要从一个~150M的记录表中检索由唯一数字ID标识的1,000 - 50,000条记录。我们在AWS RDS中托管数据库。该表有几个integer列，一个character varying(500)列和一个用于id列的bigint。每一列都有btree索引。

当前的生产查询是

SELECT *
FROM mytable t
WHERE id IN (N1, N2, .. Nm)

对于m< 1,000，返回时间不到1秒，这是可以接受的。问题是时间随m线性增加。当m= 30,000时，查询需要20+秒。

我们尝试创建索引临时表和使用INNER JOIN，但没有明显的性能改进。(https://stackoverflow.com/a/24647700/226960)

下面是m> 70k的缩写转储。

CREATE TEMPORARY TABLE temp_phrases (phrase_id integer) ON COMMIT DROP
CREATE TABLE temp_phrases: OK, time: 0.01 seconds.
CREATE INDEX temp_phrases_phrase_id_idx ON temp_phrases(phrase_id)
CREATE INDEX '.temp_phrases.'_phrase_id_idx: OK, time: 0 seconds.
INSERT INTO TABLE temp_phrases: 70544, time: 0.3 seconds.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EXPLAIN SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
Nested Loop Left Join  (cost=0.57..665368.61 rows=79815 width=34)
->  Seq Scan on temp_phrases (cost=0.00..1111.15 rows=79815 width=4)
->  Index Scan using thesaurus_pkey on thesaurus  (cost=0.57..8.31 rows=1 width=42)
    Index Cond: (id = temp_phrases.phrase_id)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
TEMP TABLE AND JOIN: 70544 results, 52.2seconds

如果我们重复查询，它只需要不到一秒的时间就能得到结果，这将表明与硬件相关的瓶颈

https://stackoverflow.com/a/24254825/226960

有可能改进原始的id IN(_list_)查询吗？获得额外的RDS IOPS会有帮助吗？

编辑

EXPLAIN (ANALYZE, BUFFERS)输出

INSERT INTO TABLE temp_phrases: 41504, time: 0.17 seconds.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
Nested Loop Left Join  (cost=0.57..396319.90 rows=46920 width=34) (actual time=0.708..23874.200 rows=41504 loops=1)
  Buffers: shared hit=167593 read=39458 dirtied=138, local hit=184
  ->  Seq Scan on temp_phrases  (cost=0.00..653.20 rows=46920 width=4) (actual time=0.012..21.138 rows=41504 loops=1)
        Buffers: local hit=184
  ->  Index Scan using thesaurus_pkey on thesaurus  (cost=0.57..8.42 rows=1 width=42) (actual time=0.569..0.572 rows=1 loops=41504)
        Index Cond: (id = temp_phrases.phrase_id)
        Buffers: shared hit=167593 read=39458 dirtied=138
Planning time: 1.493 ms
Execution time: 23887.493 ms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
TEMP TABLE AND JOIN: 41504 results, 24.2seconds

amazon-rds

sql

postgresql

回答 1

Stack Overflow用户

发布于 2017-12-15 07:26:21

使用btree index，或者根本不使用索引，只会让它尝试匹配每一个单独的行组合。

但是，如果你在比赛结束后让它停止，你将获得两倍的速度。

临时表上的唯一键或主键。因为当您加入它时，它将知道一旦找到一个匹配项，就不会再找到其他匹配项。

根据我的经验，当从A中选择并加入B时，在唯一键上加入会有很大帮助。性能提升通常减少50%的时间(与简单的索引相比)。

如果唯一键在其他情况下是不可能的，哈希索引将做同样的事情，尽管它将需要一段时间来索引。

但正如您在计划中看到的，它执行序列扫描，而不是临时表的索引扫描。使用任何索引都可能有所帮助，因为那里有相当多的元组。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47822176

复制

相似问题

问Postgres:使用IN vs JOIN vs选择大量行
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Postgres:使用IN vs JOIN vs选择大量行EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Postgres:使用IN vs JOIN vs选择大量行
EN