文章/答案/技术大牛

发布

问如何优化IP地址查询
EN

Stack Overflow用户

提问于 2022-07-12 01:21:58

回答 1查看 278关注 0票数 3

我试图在以下原始模式(dataset来自https://ipinfo.io/免费试用版)上优化ip地址地理位置查询：

CREATE TABLE raw_geolocations (
    start_ip INET NOT NULL,
    end_ip INET NOT NULL,
    join_key CIDR NOT NULL,
    city TEXT NOT NULL,
    region TEXT,
    country TEXT NOT NULL,
    lat NUMERIC NOT NULL,
    lng NUMERIC NOT NULL,
    postal TEXT,
    timezone TEXT NOT NULL
);

正在查询：

select *
from unnest(array[
    inet '<ip_address_string>'
]) ip_address
left join raw_geolocations g on ip_address between start_ip and end_ip;

速度非常慢(在50s和120s之间，用于单个ip地址查询)。

因此，我遵循了一个“指南”这里，并将raw_geolocations迁移到一个名为ip_geolocations的新表，该表在ip_segment字段上创建了一个gist索引：

create type iprange as range (subtype=inet);

create table ip_geolocations(
    id bigserial primary key not null,
    ip_segment iprange not null,
    join_key cidr not null,
    city TEXT NOT NULL,
    region TEXT,
    country TEXT NOT NULL,
    lat NUMERIC NOT NULL,
    lng NUMERIC NOT NULL,
    postal TEXT,
    timezone TEXT NOT NULL
);

insert into ip_geolocations(ip_segment, join_key, city, region, country, lat, lng, postal, timezone)
select iprange(start_ip, end_ip, '[]'), join_key, city, region, country, lat, lng, postal, timezone
from raw_geolocations;

create index gist_idx_ip_geolocations_ip_segment on ip_geolocations USING gist (ip_segment);

对于单个ip地址查询来说，这是很好的，在20ms和200ms之间带来了执行时间，但是对于大容量查询来说，执行时间相当慢，一次从1.5s到3s需要100个ip地址。

样例查询

select *
from unnest(array[
    inet '<ip_address_string>'
]) ip_address
left join ip_geolocations g on g.ip_segment @> ip_address;

当我深入到其他优化中时，会遇到一些问题：

然后，我决定查看gin索引，但在运行以下命令时遇到以下postgres错误：

create index test_idx_geolocations on ip_geolocations using gin (ip_segment);

ERROR:  operator class "inet_ops" does not exist for access method "gin"
Time: 2.173 ms

我很难理解如何使用inet_ops作为我定义的iprange类型甚至inet和cidr类型的运算符类的说明。此外，我试图定义自己的运算符类，但没有结果：

create operator class gin_iprange_ops default for type iprange
using gin as
    OPERATOR        7       @> (iprange, inet),
    OPERATOR        8       <@ (inet, iprange);

ERROR:  operator does not exist: iprange @> inet

这个错误对我来说没有多大意义，因为上面的示例查询与@>操作符一起工作。

在被阻塞时，我决定看看仅仅使用postgres配置是否会改善查询时间，因此我做了以下更改：

alter table ip_geolocations set (parallel_workers = 4);

set max_parallel_maintenance_workers to 4;

set maintenance_work_mem to '1 GB';

这似乎非线性地改善了索引创建时间，但在查询执行时间方面没有任何明显的效果。

最后，还有其他类似形式的ip地址表：

CREATE TABLE raw_<other_table> (
    start_ip INET NOT NULL,
    end_ip INET NOT NULL,
    join_key CIDR NOT NULL,
    ... <other_fields>
);

我想要加入。对于这些其他表(3 )，我遵循了类似的步骤，并将它们迁移到带有iprange字段和gist索引的ip_<other_table>表中。

不幸的是，在join_key上加入这些表仍然非常缓慢。

任何帮助都是非常感谢的。I对IP地址查询(范围包含查询)、自定义操作符类、索引类型(gist和gin)非常陌生，所以任何可能帮助我的指针或文档都会很好。

编辑

查询计划中，我只检查ip_address是否在ip_segment: iprange范围内：

explain analyze select *
from unnest(array[
    inet '98.237.137.99'
]) ip_address
left join ip_geolocations g on g.ip_segment @> ip_address
left join ip_companies c on c.ip_segment @> ip_address
left join ip_carriers cr on cr.ip_segment @> ip_address
left join ip_privacy_detections pd on pd.ip_segment @> ip_address;
                                                                                     QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=1930.48..7972624532924773.00 rows=931561551811655 width=412) (actual time=562.324..589.083 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=23.78..1160195946065.69 rows=116337334725 width=356) (actual time=287.962..314.719 rows=1 loops=1)
         ->  Nested Loop Left Join  (cost=0.84..30592828311.99 rows=1694915933 width=293) (actual time=282.013..308.768 rows=1 loops=1)
               ->  Nested Loop Left Join  (cost=0.42..515233.65 rows=27965 width=179) (actual time=270.532..297.172 rows=1 loops=1)
                     ->  Function Scan on unnest ip_address  (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.008 rows=1 loops=1)
                     ->  Index Scan using idx_ip_companies_ip_segment on ip_companies c  (cost=0.42..513835.38 rows=139826 width=147) (actual time=270.519..297.157 rows=1 loops=1)
                           Index Cond: (ip_segment @> ip_address.ip_address)
               ->  Index Scan using idx_ip_geolocations_ip_segment on ip_geolocations g  (cost=0.42..1090919.64 rows=303041 width=114) (actual time=11.474..11.588 rows=1 loops=1)
                     Index Cond: (ip_segment @> ip_address.ip_address)
         ->  Bitmap Heap Scan on ip_carriers cr  (cost=22.94..663.04 rows=343 width=63) (actual time=5.939..5.939 rows=0 loops=1)
               Recheck Cond: (ip_segment @> ip_address.ip_address)
               ->  Bitmap Index Scan on test_idx_cr  (cost=0.00..22.85 rows=343 width=0) (actual time=5.935..5.935 rows=0 loops=1)
                     Index Cond: (ip_segment @> ip_address.ip_address)
   ->  Bitmap Heap Scan on ip_privacy_detections pd  (cost=1906.70..68119.89 rows=40037 width=56) (actual time=274.352..274.353 rows=0 loops=1)
         Recheck Cond: (ip_segment @> ip_address.ip_address)
         ->  Bitmap Index Scan on idx_ip_privacy_detections_ip_segment  (cost=0.00..1896.69 rows=40037 width=0) (actual time=274.349..274.350 rows=0 loops=1)
               Index Cond: (ip_segment @> ip_address.ip_address)
 Planning Time: 0.739 ms
 Execution Time: 589.823 ms
(19 rows)

注意，对于被查询的4个表中的每个表，我在它们的gist字段上构建了一个ip_segment索引。

这些left join中的每一个都可能是应用程序层中相互关联的单独查询，但在理想的情况下，我可以让rds为我处理关联(联接)。

只对一个ip地址运行上面的查询将使用26.376 ms。对于大约100个随机ip地址，它大约需要11309.123 ms (11秒)(并在运行几次相同的查询后提高到大约1.8秒)。

在查询计划中，我使用为表联接提供的cidr join_key字段：

explain analyze select *
from unnest(array[
    inet '98.237.137.99'
]) ip_address
left join ip_geolocations g on g.ip_segment @> ip_address
left join ip_companies c on c.join_key = g.join_key
left join ip_carriers cr on cr.join_key = g.join_key
left join ip_privacy_detections pd on pd.join_key = g.join_key;
                                                                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=13413427.68..762295480.26 rows=49873343109 width=412) (actual time=53725.909..53726.119 rows=121 loops=1)
   Merge Cond: ((c.join_key)::inet = (g.join_key)::inet)
   ->  Sort  (cost=10678179.38..10748092.42 rows=27965218 width=147) (actual time=45444.278..46291.865 rows=4939342 loops=1)
         Sort Key: c.join_key
         Sort Method: external merge  Disk: 4415120kB
         ->  Seq Scan on ip_companies c  (cost=0.00..910715.18 rows=27965218 width=147) (actual time=0.014..4597.313 rows=27965218 loops=1)
   ->  Materialize  (cost=2735248.30..3478041.50 rows=41149314 width=265) (actual time=6963.430..6963.598 rows=121 loops=1)
         ->  Merge Left Join  (cost=2735248.30..3375168.21 rows=41149314 width=265) (actual time=6963.426..6963.502 rows=121 loops=1)
               Merge Cond: ((g.join_key)::inet = (pd.join_key)::inet)
               ->  Merge Left Join  (cost=1112932.44..1115281.67 rows=124973 width=209) (actual time=80.721..80.725 rows=1 loops=1)
                     Merge Cond: ((g.join_key)::inet = (cr.join_key)::inet)
                     ->  Sort  (cost=1103325.02..1103476.54 rows=60608 width=146) (actual time=28.093..28.094 rows=1 loops=1)
                           Sort Key: g.join_key
                           Sort Method: quicksort  Memory: 25kB
                           ->  Nested Loop Left Join  (cost=0.42..1093950.06 rows=60608 width=146) (actual time=27.717..28.086 rows=1 loops=1)
                                 ->  Function Scan on unnest ip_address  (cost=0.00..0.01 rows=1 width=32) (actual time=0.016..0.017 rows=1 loops=1)
                                 ->  Index Scan using idx_ip_geolocations_ip_segment on ip_geolocations g  (cost=0.42..1090919.64 rows=303041 width=114) (actual time=27.695..28.062 rows=1 loops=1)
                                       Index Cond: (ip_segment @> ip_address.ip_address)
                     ->  Materialize  (cost=9607.42..9950.61 rows=68639 width=63) (actual time=44.198..50.257 rows=20609 loops=1)
                           ->  Sort  (cost=9607.42..9779.01 rows=68639 width=63) (actual time=44.195..47.308 rows=20609 loops=1)
                                 Sort Key: cr.join_key
                                 Sort Method: external merge  Disk: 5120kB
                                 ->  Seq Scan on ip_carriers cr  (cost=0.00..1510.39 rows=68639 width=63) (actual time=0.486..12.759 rows=68639 loops=1)
               ->  Materialize  (cost=1622315.86..1662352.94 rows=8007417 width=56) (actual time=5143.369..6417.708 rows=4015488 loops=1)
                     ->  Sort  (cost=1622315.86..1642334.40 rows=8007417 width=56) (actual time=5143.366..5843.105 rows=4015488 loops=1)
                           Sort Key: pd.join_key
                           Sort Method: external merge  Disk: 439120kB
                           ->  Seq Scan on ip_privacy_detections pd  (cost=0.00..156763.17 rows=8007417 width=56) (actual time=0.802..845.180 rows=8007417 loops=1)
 Planning Time: 12.618 ms
 Execution Time: 54800.482 ms
(30 rows)

注意，对于被查询的4个表中的每个表，我在两个字段( gist字段)上构建了一个(ip_segment range_ops, join_key inet_ops)索引。

即使是一个ip地址，这个查询也要花费大约59秒的时间执行，这是非常糟糕的。我想知道这是否与使用range_ops而不是inet_ops作为ip_segment gist索引字段有关。

让我困惑的是，为什么我构建在join_key和ip_segment上的join_key索引甚至没有被用于3个不是ip_geolocations的表。

需要注意的一点是，我正在将它与ipinfo的实际API进行比较，对于同一组ip地址，对于每个完整请求(每批100个ip地址)，每个请求的长度都小于650 ms (这意味着有很大的改进空间)。

postgresql

indexing

ip-address

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-07-28 23:49:52

实际上，最好的方法是在将ip列转换为start_ip字段之后，在INET字段上创建一个BTREE索引。然后，查询是对<=或BETWEEN的简单检查。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72946115

复制

相似问题

问如何优化IP地址查询
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何优化IP地址查询EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何优化IP地址查询
EN