首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >利用word_similarity优化邮政分选

利用word_similarity优化邮政分选
EN

Database Administration用户
提问于 2020-11-23 20:17:19
回答 2查看 1.3K关注 0票数 2

在Postgresql 12中,我有一张有200万行酒店名称的表,我正在构建一个提前打字机,这样用户就可以根据名称快速找到酒店。我玩过不同的Postgres选项,包括FTS,trigram和levenshtein距离。

word_similarity in pg_trgm给了我最好的结果,但是每当我需要根据相似性评分进行排序时,事情变得太慢了(不对查询进行排序,以毫秒为单位):

代码语言:javascript
复制
SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel where 'trade center new york mariott' %> name ORDER BY sml DESC LIMIT 5;

                                          name                                          |    sml
----------------------------------------------------------------------------------------+------------
 Courtyard by Marriott New York Downtown Manhattan/World Trade Center Area              | 0.53846157
 Fairfield Inn & Suites by Marriott New York Downtown Manhattan/World Trade Center Area | 0.53846157
 Residence Inn by Marriott New York Downtown Manhattan/World Trade Center Area          | 0.53846157
 AC Hotel by Marriott New York Times Square                                             |    0.53125
 Courtyard by Marriott World Trade Center, Abu Dhabi                                    |  0.5263158
(5 rows)

Time: 9602.969 ms (00:09.603)
代码语言:javascript
复制
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, COSTS) SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel where 'trade center new york mariott' %> name ORDER BY sml DESC LIMIT 5;
                                                                                          QUERY PLAN                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=65390.53..65391.11 rows=5 width=27) (actual time=9619.113..9625.482 rows=5 loops=1)
   Output: name, (word_similarity('trade center new york mariott'::text, (name)::text))
   Buffers: shared hit=1746167
   ->  Gather Merge  (cost=65390.53..65589.11 rows=1702 width=27) (actual time=9619.109..9625.474 rows=5 loops=1)
         Output: name, (word_similarity('trade center new york mariott'::text, (name)::text))
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=1746167
         ->  Sort  (cost=64390.50..64392.63 rows=851 width=27) (actual time=9612.578..9612.580 rows=4 loops=3)
               Output: name, (word_similarity('trade center new york mariott'::text, (name)::text))
               Sort Key: (word_similarity('trade center new york mariott'::text, (hotels_hotel.name)::text)) DESC
               Sort Method: top-N heapsort  Memory: 25kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               Buffers: shared hit=1746167
               Worker 0: actual time=9610.098..9610.100 rows=5 loops=1
                 Buffers: shared hit=581861
               Worker 1: actual time=9609.314..9609.317 rows=5 loops=1
                 Buffers: shared hit=579828
               ->  Parallel Index Only Scan using hotels_hotel_name_a9005e17 on public.hotels_hotel  (cost=0.43..64376.37 rows=851 width=27) (actual time=4.040..9606.166 rows=15070 loops=3)
                     Output: name, word_similarity('trade center new york mariott'::text, (name)::text)
                     Filter: ('trade center new york mariott'::text %> (hotels_hotel.name)::text)
                     Rows Removed by Filter: 666002
                     Heap Fetches: 2
                     Buffers: shared hit=1746113
                     Worker 0: actual time=0.281..9603.591 rows=14890 loops=1
                       Buffers: shared hit=581834
                     Worker 1: actual time=8.157..9602.811 rows=14678 loops=1
                       Buffers: shared hit=579801
 Planning Time: 0.396 ms
 Execution Time: 9625.576 ms
(31 rows)

Time: 9626.933 ms (00:09.627)

有什么方法可以优化这个查询吗?很乐意像我想的那样建立一个巨大的索引。

为什么word_similarity最适合我的用例:

  • 允许输入(注意我在查询中拼写错了万豪)
  • 允许单词的不同顺序

编辑1

没有排序的解释分析:

代码语言:javascript
复制
EXPLAIN ANALYZE SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel where 'trade center new york mariott' %> name LIMIT 5;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1264.75 rows=5 width=27) (actual time=0.391..15.981 rows=5 loops=1)
   ->  Seq Scan on hotels_hotel  (cost=0.00..516777.29 rows=2043 width=27) (actual time=0.384..15.969 rows=5 loops=1)
         Filter: ('trade center new york mariott'::text %> (name)::text)
         Rows Removed by Filter: 292
 Planning Time: 0.285 ms
 Execution Time: 16.041 ms
(6 rows)

Time: 17.800 ms

编辑2

表中列出的相关指数:

代码语言:javascript
复制
CREATE INDEX autocomplete_gist ON public.hotels_hotel USING gist (name gist_trgm_ops)
CREATE INDEX autocomplete_name_idx ON public.hotels_hotel USING gin (name gin_trgm_ops)
CREATE INDEX hotels_hotel_name_a9005e17 ON public.hotels_hotel USING btree (name)
CREATE INDEX hotels_hotel_name_a9005e17_like ON public.hotels_hotel USING btree (name varchar_pattern_ops)
CREATE UNIQUE INDEX hotels_hotel_pkey ON public.hotels_hotel USING btree (id)
CREATE INDEX hotels_hotel_popularity_05985c85 ON public.hotels_hotel USING btree (popularity)

编辑3

新的查询:

代码语言:javascript
复制
EXPLAIN (ANALYZE, BUFFERS) SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel ORDER BY name <->> 'trade center new york mariott' LIMIT 5;
                                                                       QUERY PLAN

 Limit  (cost=0.41..5.65 rows=5 width=31) (actual time=3632.397..3633.554 rows=5 loops=1)
   Buffers: shared hit=1 read=32537
   I/O Timings: read=184.432
   ->  Index Scan using autocomplete_gist on hotels_hotel  (cost=0.41..2140836.75 rows=2043215 width=31) (actual time=3632.394..3633.548 rows=5 loops=1)
         Order By: ((name)::text <->> 'trade center new york mariott'::text)
         Buffers: shared hit=1 read=32537
         I/O Timings: read=184.432
 Planning Time: 0.250 ms
 Execution Time: 3679.847 ms
EN

回答 2

Database Administration用户

发布于 2020-11-24 03:40:56

使用GiST索引:

代码语言:javascript
复制
CREATE INDEX ON hotels_hotel USING gist (name gist_trgm_ops);

然后像这样搜索:

代码语言:javascript
复制
SELECT name,
       word_similarity('trade center new york mariott', name) AS sml
FROM hotels_hotel
ORDER BY name <->> 'trade center new york mariott'
LIMIT 5;

这会让你得到最接近的五支火柴。

票数 1
EN

Database Administration用户

发布于 2020-11-25 03:51:51

与GiST相比,使用GIN索引可能会更幸运。

GIN不像GiST那样支持KNN,所以您必须应用具有适当值的pg_trgm.word_similarity_threshold的匹配操作,然后对幸存下来的行进行排序。

代码语言:javascript
复制
SELECT name, word_similarity('trade center new york mariott', name) AS sml from hotels_hotel 
WHERE name %> 'trade center new york mariott' 
ORDER BY name <->> 'trade center new york mariott' 
LIMIT 5;

规划师并没有对GIN和GiST索引的成本做出非常巧妙的区分,因此您可能需要删除GiST索引才能让它使用GIN。

票数 1
EN
页面原文内容由Database Administration提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://dba.stackexchange.com/questions/280196

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档