首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >我想使用R匹配两个基于任意地址字段的数据集

我想使用R匹配两个基于任意地址字段的数据集
EN

Stack Overflow用户
提问于 2021-05-05 05:30:30
回答 2查看 43关注 0票数 0

第一个数据集df1

代码语言:javascript
复制
structure(list(ID = 1:8, Address = c("Canal and Broadway", "55 water street room number 73", 
"Mulberry street", "Front street and Fulton", "62nd street ", 
"wythe street", "vanderbilt avenue", "South Beach avenue")), class = "data.frame", row.names = c(NA, 
-8L))

第二个数据集df2

代码语言:javascript
复制
structure(list(ID2 = 1:8, Address = c("Canal & Broadway", "Somewhere around 55 water street", 
"Mulberry street", "Front street and close to Fulton", "south beach avenue", 
"along wythe street on the southwest ", "vanderbilt ave", "62nd street"
)), class = "data.frame", row.names = c(NA, -8L))

df1

代码语言:javascript
复制
ID|Address
1   Canal and Broadway
2   55 water street room number 73
3   Mulberry street
4   Front street and Fulton
5   62nd street 
6   wythe street
7   vanderbilt avenue
8   South Beach avenue

df2

代码语言:javascript
复制
ID2|Address
 1  Canal & Broadway
 2  Somewhere around 55 water street
 3  Mulberry street
 4  Front street and close to Fulton
 5  south beach avenue
 6  along wythe street on the southwest 
 7  vanderbilt ave
 8  62nd street

有没有办法匹配并得到这样的结果。请注意,这些地址是相似的,但并不完全相同。

代码语言:javascript
复制
    ID|                    Address| ID2
1   Canal and Broadway               1
2   55 water street room number 73   2
3   Mulberry street                  3
4   Front street and Fulton          4
5   62nd street                      8
6   wythe street                     6
7   vanderbilt avenue                7
8   South Beach avenue               5
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-05-05 11:21:21

正如@r2evans建议的那样,请查看fuzzyjoin包。这不会为您提供开箱即用的预期输出,但会帮助您入门。

代码语言:javascript
复制
fuzzyjoin::stringdist_left_join(df1, df2, by = 'Address', max_dist = 5)

#  ID                      Address.x ID2          Address.y
#1  1             Canal and Broadway   1   Canal & Broadway
#2  2 55 water street room number 73  NA               <NA>
#3  3                Mulberry street   3    Mulberry street
#4  4        Front street and Fulton  NA               <NA>
#5  5                   62nd street    8        62nd street
#6  6                   wythe street   8        62nd street
#7  7              vanderbilt avenue   7     vanderbilt ave
#8  8             South Beach avenue   5 south beach avenue

您可能需要玩弄maxdist参数并应用其他规则才能得到最终输出。

票数 3
EN

Stack Overflow用户

发布于 2021-05-06 01:01:25

我们可以使用method = 'soundex

代码语言:javascript
复制
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = 'Address', method = 'soundex')
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67392470

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档