文章/答案/技术大牛

发布

社区首页 >问答首页 >PostgreSQL不平衡索引查询优化

问PostgreSQL不平衡索引查询优化
EN

Database Administration用户

提问于 2022-12-12 11:51:02

回答 1查看 72关注 0票数 2

我的查询存在性能问题，因为索引严重不平衡。我在桌子上有一个列family_id。这张表有大约1.5亿份记录。此family_id列标识表中哪些记录属于同一个家族。一个家庭平均有5名成员。但是，由于数据不清晰，在我们的数据中存在边缘情况，其中family_id为NULL (~600.000条记录)或family_id为-1 (~320万条记录)。

我正在Logstash中构建一个数据管道，它查询这些数据并将其逐步转发到Elasticsearch索引。对于我的用例，我需要在每个家庭的基础上带来数据。在family_id是-1或NULL的情况下，需要特殊处理，因此我需要在此基础上进行筛选。然而，这样的查询对于一批10.000条记录来说花费了非常长的时间。给出一个时差的概念；正常的情况大约需要10-20秒，而-1或NULL的情况大约需要30分钟。

对于这种情况，是否有适当的解决方案，使我能够加快查询的速度？

谢谢!

查询和DDL根据请求我已经添加了表DDL、Query和EXPLAIN ANALYZE。表DDL：

CREATE TABLE "xml".t_patent_document_values (
    patent_document_value_id serial4 NOT NULL,
    publication_id int4 NOT NULL,
    ucid varchar(32) NULL,
    lang bpchar(2) NULL,
    country bpchar(2) NULL,
    doc_number varchar(32) NULL,
    kind varchar(4) NULL,
    published date NULL,
    produced date NULL,
    withdraw bool NULL DEFAULT false,
    family_id int4 NULL,
    status varchar(16) NULL,
    modified_load_id int4 NOT NULL,
    created_load_id int4 NULL,
    deleted_load_id int4 NULL,
    CONSTRAINT t_patent_document_values_pkey PRIMARY KEY (patent_document_value_id)
);
CREATE INDEX family_id_publication_id_idx ON xml.t_patent_document_values USING btree (family_id, publication_id) WHERE (family_id = '-1'::integer);
CREATE INDEX family_id_publication_id_null_idx ON xml.t_patent_document_values USING btree (family_id, publication_id) WHERE ((family_id IS NULL) AND (publication_id > 0));
CREATE INDEX idx_country_published ON xml.t_patent_document_values USING btree (country, published);
CREATE INDEX idx_created_load_id ON xml.t_patent_document_values USING btree (created_load_id);
CREATE INDEX idx_created_load_id_modified_load_id ON xml.t_patent_document_values USING btree (created_load_id, modified_load_id);
CREATE INDEX idx_created_load_id_modified_load_id_family_id ON xml.t_patent_document_values USING btree (created_load_id, modified_load_id, family_id);
CREATE INDEX idx_ctry_dnum ON xml.t_patent_document_values USING btree (country, doc_number);
CREATE INDEX idx_family_id ON xml.t_patent_document_values USING btree (family_id);
CREATE INDEX idx_modified_load_id ON xml.t_patent_document_values USING btree (modified_load_id);
CREATE INDEX idx_patent_document_publication_id ON xml.t_patent_document_values USING btree (publication_id);
CREATE INDEX idx_patent_document_published ON xml.t_patent_document_values USING btree (published);
CREATE INDEX idx_patent_document_ucid ON xml.t_patent_document_values USING btree (ucid);
CREATE INDEX idx_patent_document_values_country ON xml.t_patent_document_values USING btree (country);
CREATE INDEX idx_published_month ON xml.t_patent_document_values USING btree (date_part('month'::text, published));
CREATE INDEX idx_published_year ON xml.t_patent_document_values USING btree (date_part('year'::text, published));
CREATE INDEX idx_t_patent_document_values_cmp ON xml.t_patent_document_values USING btree (publication_id, modified_load_id);
CREATE INDEX idx_withdrawn ON xml.t_patent_document_values USING btree (withdraw);

快速查询(family_id != -1 or family_id is not null)：

select array_agg(tpdv.publication_id)
from 
    xml.t_patent_document_values tpdv 
where 
    tpdv.family_id > 0
group by tpdv.family_id
order by tpdv.family_id
limit 1000

它的EXPLAIN ANALYZE：

Limit  (cost=0.57..20178.18 rows=1000 width=36) (actual time=0.017..16.191 rows=1000 loops=1)
  ->  GroupAggregate  (cost=0.57..473243036.65 rows=23453864 width=36) (actual time=0.016..16.123 rows=1000 loops=1)
        Group Key: family_id
        ->  Index Scan using idx_family_id on t_patent_document_values tpdv  (cost=0.57..472220861.17 rows=145800436 width=8) (actual time=0.012..15.608 rows=1002 loops=1)
              Index Cond: (family_id > 0)
Planning time: 0.323 ms
Execution time: 16.259 ms

请注意，实际上，这个查询将在family_id上完成更多的聚合，但这是一个简化的版本，因此问题可能更容易解决。

缓慢的查询：

select *
from 
    xml.t_patent_document_values tpdv 
    where (tpdv.family_id = -1 or tpdv.family_id is null) and publication_id > 0
order by publication_id
limit 1000

它的EXPLAIN ANALYZE：

Limit  (cost=0.57..122889.16 rows=1000 width=73) (actual time=52632.236..1731648.507 rows=1000 loops=1)
  ->  Index Scan using idx_patent_document_publication_id on t_patent_document_values tpdv  (cost=0.57..484776575.75 rows=3944846 width=73) (actual time=52632.235..1731648.326 rows=1000 loops=1)
        Index Cond: (publication_id > 0)
        Filter: ((family_id = '-1'::integer) OR (family_id IS NULL))
        Rows Removed by Filter: 27646323
Planning time: 3.102 ms
Execution time: 1731657.620 ms

index-tuning

postgresql

query-performance

postgresql-performance

回答 1

Database Administration用户

发布于 2022-12-12 22:37:07

快速查询是快速的，因为它可以为组BY和ORDER使用索引idx_family_id，因此它可以扫描前1000 family_id的索引，然后由于限制1000而停止。

慢速查询需要使用两个索引，family_id_publication_id_idx和idx_family_id，以满足OR中的这两种条件，因此它需要读取所有1314953行，然后按publication_id对它们进行排序，并且只保留前1000行。

要加快查询速度，需要一个满足WHERE中的OR条件并以publication_id作为第一列的单一索引来满足ORDER。

试着用这个：

CREATE INDEX publication_id_family_id_null_or_neg_idx ON xml.t_patent_document_values USING btree (publication_id, family_id) WHERE ((family_id IS NULL or family_id = -1) AND (publication_id > 0));

使用此索引，Postresql只需扫描索引的前1000行，并且它们已经按照所需的顺序排列，因此可以停止进行限制。

票数 0

页面原文内容由Database Administration提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://dba.stackexchange.com/questions/320847

复制

相似问题

问PostgreSQL不平衡索引查询优化
EN

回答 1

Database Administration用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PostgreSQL不平衡索引查询优化EN

回答 1

Database Administration用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PostgreSQL不平衡索引查询优化
EN