我的查询存在性能问题,因为索引严重不平衡。我在桌子上有一个列family_id。这张表有大约1.5亿份记录。此family_id列标识表中哪些记录属于同一个家族。一个家庭平均有5名成员。但是,由于数据不清晰,在我们的数据中存在边缘情况,其中family_id为NULL (~600.000条记录)或family_id为-1 (~320万条记录)。
我正在Logstash中构建一个数据管道,它查询这些数据并将其逐步转发到Elasticsearch索引。对于我的用例,我需要在每个家庭的基础上带来数据。在family_id是-1或NULL的情况下,需要特殊处理,因此我需要在此基础上进行筛选。然而,这样的查询对于一批10.000条记录来说花费了非常长的时间。给出一个时差的概念;正常的情况大约需要10-20秒,而-1或NULL的情况大约需要30分钟。
对于这种情况,是否有适当的解决方案,使我能够加快查询的速度?
谢谢!
查询和DDL根据请求我已经添加了表DDL、Query和EXPLAIN ANALYZE。表DDL:
CREATE TABLE "xml".t_patent_document_values (
patent_document_value_id serial4 NOT NULL,
publication_id int4 NOT NULL,
ucid varchar(32) NULL,
lang bpchar(2) NULL,
country bpchar(2) NULL,
doc_number varchar(32) NULL,
kind varchar(4) NULL,
published date NULL,
produced date NULL,
withdraw bool NULL DEFAULT false,
family_id int4 NULL,
status varchar(16) NULL,
modified_load_id int4 NOT NULL,
created_load_id int4 NULL,
deleted_load_id int4 NULL,
CONSTRAINT t_patent_document_values_pkey PRIMARY KEY (patent_document_value_id)
);
CREATE INDEX family_id_publication_id_idx ON xml.t_patent_document_values USING btree (family_id, publication_id) WHERE (family_id = '-1'::integer);
CREATE INDEX family_id_publication_id_null_idx ON xml.t_patent_document_values USING btree (family_id, publication_id) WHERE ((family_id IS NULL) AND (publication_id > 0));
CREATE INDEX idx_country_published ON xml.t_patent_document_values USING btree (country, published);
CREATE INDEX idx_created_load_id ON xml.t_patent_document_values USING btree (created_load_id);
CREATE INDEX idx_created_load_id_modified_load_id ON xml.t_patent_document_values USING btree (created_load_id, modified_load_id);
CREATE INDEX idx_created_load_id_modified_load_id_family_id ON xml.t_patent_document_values USING btree (created_load_id, modified_load_id, family_id);
CREATE INDEX idx_ctry_dnum ON xml.t_patent_document_values USING btree (country, doc_number);
CREATE INDEX idx_family_id ON xml.t_patent_document_values USING btree (family_id);
CREATE INDEX idx_modified_load_id ON xml.t_patent_document_values USING btree (modified_load_id);
CREATE INDEX idx_patent_document_publication_id ON xml.t_patent_document_values USING btree (publication_id);
CREATE INDEX idx_patent_document_published ON xml.t_patent_document_values USING btree (published);
CREATE INDEX idx_patent_document_ucid ON xml.t_patent_document_values USING btree (ucid);
CREATE INDEX idx_patent_document_values_country ON xml.t_patent_document_values USING btree (country);
CREATE INDEX idx_published_month ON xml.t_patent_document_values USING btree (date_part('month'::text, published));
CREATE INDEX idx_published_year ON xml.t_patent_document_values USING btree (date_part('year'::text, published));
CREATE INDEX idx_t_patent_document_values_cmp ON xml.t_patent_document_values USING btree (publication_id, modified_load_id);
CREATE INDEX idx_withdrawn ON xml.t_patent_document_values USING btree (withdraw);快速查询(family_id != -1 or family_id is not null):
select array_agg(tpdv.publication_id)
from
xml.t_patent_document_values tpdv
where
tpdv.family_id > 0
group by tpdv.family_id
order by tpdv.family_id
limit 1000它的EXPLAIN ANALYZE:
Limit (cost=0.57..20178.18 rows=1000 width=36) (actual time=0.017..16.191 rows=1000 loops=1)
-> GroupAggregate (cost=0.57..473243036.65 rows=23453864 width=36) (actual time=0.016..16.123 rows=1000 loops=1)
Group Key: family_id
-> Index Scan using idx_family_id on t_patent_document_values tpdv (cost=0.57..472220861.17 rows=145800436 width=8) (actual time=0.012..15.608 rows=1002 loops=1)
Index Cond: (family_id > 0)
Planning time: 0.323 ms
Execution time: 16.259 ms请注意,实际上,这个查询将在family_id上完成更多的聚合,但这是一个简化的版本,因此问题可能更容易解决。
缓慢的查询:
select *
from
xml.t_patent_document_values tpdv
where (tpdv.family_id = -1 or tpdv.family_id is null) and publication_id > 0
order by publication_id
limit 1000它的EXPLAIN ANALYZE:
Limit (cost=0.57..122889.16 rows=1000 width=73) (actual time=52632.236..1731648.507 rows=1000 loops=1)
-> Index Scan using idx_patent_document_publication_id on t_patent_document_values tpdv (cost=0.57..484776575.75 rows=3944846 width=73) (actual time=52632.235..1731648.326 rows=1000 loops=1)
Index Cond: (publication_id > 0)
Filter: ((family_id = '-1'::integer) OR (family_id IS NULL))
Rows Removed by Filter: 27646323
Planning time: 3.102 ms
Execution time: 1731657.620 ms发布于 2022-12-12 22:37:07
快速查询是快速的,因为它可以为组BY和ORDER使用索引idx_family_id,因此它可以扫描前1000 family_id的索引,然后由于限制1000而停止。
慢速查询需要使用两个索引,family_id_publication_id_idx和idx_family_id,以满足OR中的这两种条件,因此它需要读取所有1314953行,然后按publication_id对它们进行排序,并且只保留前1000行。
要加快查询速度,需要一个满足WHERE中的OR条件并以publication_id作为第一列的单一索引来满足ORDER。
试着用这个:
CREATE INDEX publication_id_family_id_null_or_neg_idx ON xml.t_patent_document_values USING btree (publication_id, family_id) WHERE ((family_id IS NULL or family_id = -1) AND (publication_id > 0));使用此索引,Postresql只需扫描索引的前1000行,并且它们已经按照所需的顺序排列,因此可以停止进行限制。
https://dba.stackexchange.com/questions/320847
复制相似问题