我试图在以下原始模式(dataset来自https://ipinfo.io/免费试用版)上优化ip地址地理位置查询:
CREATE TABLE raw_geolocations (
start_ip INET NOT NULL,
end_ip INET NOT NULL,
join_key CIDR NOT NULL,
city TEXT NOT NULL,
region TEXT,
country TEXT NOT NULL,
lat NUMERIC NOT NULL,
lng NUMERIC NOT NULL,
postal TEXT,
timezone TEXT NOT NULL
);正在查询:
select *
from unnest(array[
inet '<ip_address_string>'
]) ip_address
left join raw_geolocations g on ip_address between start_ip and end_ip;速度非常慢(在50s和120s之间,用于单个ip地址查询)。
因此,我遵循了一个“指南”这里,并将raw_geolocations迁移到一个名为ip_geolocations的新表,该表在ip_segment字段上创建了一个gist索引:
create type iprange as range (subtype=inet);
create table ip_geolocations(
id bigserial primary key not null,
ip_segment iprange not null,
join_key cidr not null,
city TEXT NOT NULL,
region TEXT,
country TEXT NOT NULL,
lat NUMERIC NOT NULL,
lng NUMERIC NOT NULL,
postal TEXT,
timezone TEXT NOT NULL
);
insert into ip_geolocations(ip_segment, join_key, city, region, country, lat, lng, postal, timezone)
select iprange(start_ip, end_ip, '[]'), join_key, city, region, country, lat, lng, postal, timezone
from raw_geolocations;
create index gist_idx_ip_geolocations_ip_segment on ip_geolocations USING gist (ip_segment);对于单个ip地址查询来说,这是很好的,在20ms和200ms之间带来了执行时间,但是对于大容量查询来说,执行时间相当慢,一次从1.5s到3s需要100个ip地址。
样例查询
select *
from unnest(array[
inet '<ip_address_string>'
]) ip_address
left join ip_geolocations g on g.ip_segment @> ip_address;当我深入到其他优化中时,会遇到一些问题:
然后,我决定查看gin索引,但在运行以下命令时遇到以下postgres错误:
create index test_idx_geolocations on ip_geolocations using gin (ip_segment);
ERROR: operator class "inet_ops" does not exist for access method "gin"
Time: 2.173 ms我很难理解如何使用inet_ops作为我定义的iprange类型甚至inet和cidr类型的运算符类的说明。此外,我试图定义自己的运算符类,但没有结果:
create operator class gin_iprange_ops default for type iprange
using gin as
OPERATOR 7 @> (iprange, inet),
OPERATOR 8 <@ (inet, iprange);
ERROR: operator does not exist: iprange @> inet这个错误对我来说没有多大意义,因为上面的示例查询与@>操作符一起工作。
在被阻塞时,我决定看看仅仅使用postgres配置是否会改善查询时间,因此我做了以下更改:
alter table ip_geolocations set (parallel_workers = 4);
set max_parallel_maintenance_workers to 4;
set maintenance_work_mem to '1 GB';这似乎非线性地改善了索引创建时间,但在查询执行时间方面没有任何明显的效果。
最后,还有其他类似形式的ip地址表:
CREATE TABLE raw_<other_table> (
start_ip INET NOT NULL,
end_ip INET NOT NULL,
join_key CIDR NOT NULL,
... <other_fields>
);我想要加入。对于这些其他表(3 ),我遵循了类似的步骤,并将它们迁移到带有iprange字段和gist索引的ip_<other_table>表中。
不幸的是,在join_key上加入这些表仍然非常缓慢。
任何帮助都是非常感谢的。I对IP地址查询(范围包含查询)、自定义操作符类、索引类型(gist和gin)非常陌生,所以任何可能帮助我的指针或文档都会很好。
编辑
ip_address是否在ip_segment: iprange范围内:explain analyze select *
from unnest(array[
inet '98.237.137.99'
]) ip_address
left join ip_geolocations g on g.ip_segment @> ip_address
left join ip_companies c on c.ip_segment @> ip_address
left join ip_carriers cr on cr.ip_segment @> ip_address
left join ip_privacy_detections pd on pd.ip_segment @> ip_address;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=1930.48..7972624532924773.00 rows=931561551811655 width=412) (actual time=562.324..589.083 rows=1 loops=1)
-> Nested Loop Left Join (cost=23.78..1160195946065.69 rows=116337334725 width=356) (actual time=287.962..314.719 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.84..30592828311.99 rows=1694915933 width=293) (actual time=282.013..308.768 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.42..515233.65 rows=27965 width=179) (actual time=270.532..297.172 rows=1 loops=1)
-> Function Scan on unnest ip_address (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.008 rows=1 loops=1)
-> Index Scan using idx_ip_companies_ip_segment on ip_companies c (cost=0.42..513835.38 rows=139826 width=147) (actual time=270.519..297.157 rows=1 loops=1)
Index Cond: (ip_segment @> ip_address.ip_address)
-> Index Scan using idx_ip_geolocations_ip_segment on ip_geolocations g (cost=0.42..1090919.64 rows=303041 width=114) (actual time=11.474..11.588 rows=1 loops=1)
Index Cond: (ip_segment @> ip_address.ip_address)
-> Bitmap Heap Scan on ip_carriers cr (cost=22.94..663.04 rows=343 width=63) (actual time=5.939..5.939 rows=0 loops=1)
Recheck Cond: (ip_segment @> ip_address.ip_address)
-> Bitmap Index Scan on test_idx_cr (cost=0.00..22.85 rows=343 width=0) (actual time=5.935..5.935 rows=0 loops=1)
Index Cond: (ip_segment @> ip_address.ip_address)
-> Bitmap Heap Scan on ip_privacy_detections pd (cost=1906.70..68119.89 rows=40037 width=56) (actual time=274.352..274.353 rows=0 loops=1)
Recheck Cond: (ip_segment @> ip_address.ip_address)
-> Bitmap Index Scan on idx_ip_privacy_detections_ip_segment (cost=0.00..1896.69 rows=40037 width=0) (actual time=274.349..274.350 rows=0 loops=1)
Index Cond: (ip_segment @> ip_address.ip_address)
Planning Time: 0.739 ms
Execution Time: 589.823 ms
(19 rows)注意,对于被查询的4个表中的每个表,我在它们的gist字段上构建了一个ip_segment索引。
这些left join中的每一个都可能是应用程序层中相互关联的单独查询,但在理想的情况下,我可以让rds为我处理关联(联接)。
只对一个ip地址运行上面的查询将使用26.376 ms。对于大约100个随机ip地址,它大约需要11309.123 ms (11秒)(并在运行几次相同的查询后提高到大约1.8秒)。
cidr join_key字段:explain analyze select *
from unnest(array[
inet '98.237.137.99'
]) ip_address
left join ip_geolocations g on g.ip_segment @> ip_address
left join ip_companies c on c.join_key = g.join_key
left join ip_carriers cr on cr.join_key = g.join_key
left join ip_privacy_detections pd on pd.join_key = g.join_key;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Right Join (cost=13413427.68..762295480.26 rows=49873343109 width=412) (actual time=53725.909..53726.119 rows=121 loops=1)
Merge Cond: ((c.join_key)::inet = (g.join_key)::inet)
-> Sort (cost=10678179.38..10748092.42 rows=27965218 width=147) (actual time=45444.278..46291.865 rows=4939342 loops=1)
Sort Key: c.join_key
Sort Method: external merge Disk: 4415120kB
-> Seq Scan on ip_companies c (cost=0.00..910715.18 rows=27965218 width=147) (actual time=0.014..4597.313 rows=27965218 loops=1)
-> Materialize (cost=2735248.30..3478041.50 rows=41149314 width=265) (actual time=6963.430..6963.598 rows=121 loops=1)
-> Merge Left Join (cost=2735248.30..3375168.21 rows=41149314 width=265) (actual time=6963.426..6963.502 rows=121 loops=1)
Merge Cond: ((g.join_key)::inet = (pd.join_key)::inet)
-> Merge Left Join (cost=1112932.44..1115281.67 rows=124973 width=209) (actual time=80.721..80.725 rows=1 loops=1)
Merge Cond: ((g.join_key)::inet = (cr.join_key)::inet)
-> Sort (cost=1103325.02..1103476.54 rows=60608 width=146) (actual time=28.093..28.094 rows=1 loops=1)
Sort Key: g.join_key
Sort Method: quicksort Memory: 25kB
-> Nested Loop Left Join (cost=0.42..1093950.06 rows=60608 width=146) (actual time=27.717..28.086 rows=1 loops=1)
-> Function Scan on unnest ip_address (cost=0.00..0.01 rows=1 width=32) (actual time=0.016..0.017 rows=1 loops=1)
-> Index Scan using idx_ip_geolocations_ip_segment on ip_geolocations g (cost=0.42..1090919.64 rows=303041 width=114) (actual time=27.695..28.062 rows=1 loops=1)
Index Cond: (ip_segment @> ip_address.ip_address)
-> Materialize (cost=9607.42..9950.61 rows=68639 width=63) (actual time=44.198..50.257 rows=20609 loops=1)
-> Sort (cost=9607.42..9779.01 rows=68639 width=63) (actual time=44.195..47.308 rows=20609 loops=1)
Sort Key: cr.join_key
Sort Method: external merge Disk: 5120kB
-> Seq Scan on ip_carriers cr (cost=0.00..1510.39 rows=68639 width=63) (actual time=0.486..12.759 rows=68639 loops=1)
-> Materialize (cost=1622315.86..1662352.94 rows=8007417 width=56) (actual time=5143.369..6417.708 rows=4015488 loops=1)
-> Sort (cost=1622315.86..1642334.40 rows=8007417 width=56) (actual time=5143.366..5843.105 rows=4015488 loops=1)
Sort Key: pd.join_key
Sort Method: external merge Disk: 439120kB
-> Seq Scan on ip_privacy_detections pd (cost=0.00..156763.17 rows=8007417 width=56) (actual time=0.802..845.180 rows=8007417 loops=1)
Planning Time: 12.618 ms
Execution Time: 54800.482 ms
(30 rows)注意,对于被查询的4个表中的每个表,我在两个字段( gist字段)上构建了一个(ip_segment range_ops, join_key inet_ops)索引。
即使是一个ip地址,这个查询也要花费大约59秒的时间执行,这是非常糟糕的。我想知道这是否与使用range_ops而不是inet_ops作为ip_segment gist索引字段有关。
让我困惑的是,为什么我构建在join_key和ip_segment上的join_key索引甚至没有被用于3个不是ip_geolocations的表。
ipinfo的实际API进行比较,对于同一组ip地址,对于每个完整请求(每批100个ip地址),每个请求的长度都小于650 ms (这意味着有很大的改进空间)。发布于 2022-07-28 23:49:52
实际上,最好的方法是在将ip列转换为start_ip字段之后,在INET字段上创建一个BTREE索引。然后,查询是对<=或BETWEEN的简单检查。
https://stackoverflow.com/questions/72946115
复制相似问题