首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >获取最近的n个匹配字符串

获取最近的n个匹配字符串
EN

Stack Overflow用户
提问于 2022-01-06 08:01:43
回答 1查看 95关注 0票数 0

嗨,我试图匹配一个字符串从其他字符串在不同的数据,并得到最近的n匹配基于得分。

示例:从string_2 (df_2)列中,我需要与string_1(df_1)匹配,并根据每个ID组得到最接近的3次匹配。

代码语言:javascript
复制
ID = c(100, 100,100,100,103,103,103,103,104,104,104,104)
string_1 = c("Jack Daniel","Jac","JackDan","steve","Mark","Dukes","Allan","Duke","Puma Nike","Puma","Nike","Addidas")

df_1 = data.frame(ID,string_1)

ID = c(100, 100, 185, 103,103, 104, 104,104)
string_2 = c("Jack Daniel","Mark","Order","Steve","Mark 2","Nike","Addidas","Reebok")

df_2 = data.frame(ID,string_2)

输出的dataframe df_out如下所示。

代码语言:javascript
复制
ID = c(100, 100,185,103,103,104,104,104)
string_2 = c("Jack Daniel","Mark","Order","Steve","Mark 2","Nike","Addidas","Reebok")
nearest_str_match_1 = c("Jack Daniel","JackDan","NA","Duke","Mark","Nike","Addidas","Nike")
nearest_str_match_2 =c("JackDan","Jack Daniel","NA","Dukes","Duke","Addidas","Nike","Puma Nike") 
nearest_str_match_3 =c("Jac","Jac","NA","Allan","Allan","Puma","Puma","Addidas") 
   
df_out = data.frame(ID,string_2,nearest_str_match_1,nearest_str_match_2,nearest_str_match_3)

我已经尝试了手工使用包"stringdist“- 'jw‘方法,并得到最近的值。

代码语言:javascript
复制
stringdist::stringdist("Jack Daniel","Jack Daniel","jw") 
stringdist::stringdist("Jack Daniel","Jac","jw")
stringdist::stringdist("Jack Daniel","JackDan","jw")

提前感谢

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-01-06 08:37:02

代码语言:javascript
复制
 merge(df_1, df_2, by = 'ID') %>%
   group_by(string_2) %>%
   mutate(dist = (stringdist::stringdist(string_2,string_1, 'jw')) %>%
            rank(ties = 'last')) %>%
   slice_min(dist, n = 3) %>%
   pivot_wider(names_from = dist, names_prefix = 'nearest_str_match_', 
               values_from = string_1)

# A tibble: 7 x 5
# Groups:   string_2 [7]
     ID string_2    nearest_str_match_1 nearest_str_match_2 nearest_str_match_3
  <dbl> <chr>       <chr>               <chr>               <chr>              
1   104 Addidas     Addidas             Nike                Puma               
2   100 Jack Daniel Jack Daniel         JackDan             Jac                
3   100 Mark        JackDan             Jack Daniel         Jac                
4   103 Mark 2      Mark                Duke                Allan              
5   104 Nike        Nike                Addidas             Puma               
6   104 Reebok      Nike                Puma Nike           Addidas            
7   103 Steve       Duke                Dukes               Allan   
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70604078

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档