第一个数据集df1
structure(list(ID = 1:8, Address = c("Canal and Broadway", "55 water street room number 73",
"Mulberry street", "Front street and Fulton", "62nd street ",
"wythe street", "vanderbilt avenue", "South Beach avenue")), class = "data.frame", row.names = c(NA,
-8L))第二个数据集df2
structure(list(ID2 = 1:8, Address = c("Canal & Broadway", "Somewhere around 55 water street",
"Mulberry street", "Front street and close to Fulton", "south beach avenue",
"along wythe street on the southwest ", "vanderbilt ave", "62nd street"
)), class = "data.frame", row.names = c(NA, -8L))df1
ID|Address
1 Canal and Broadway
2 55 water street room number 73
3 Mulberry street
4 Front street and Fulton
5 62nd street
6 wythe street
7 vanderbilt avenue
8 South Beach avenuedf2
ID2|Address
1 Canal & Broadway
2 Somewhere around 55 water street
3 Mulberry street
4 Front street and close to Fulton
5 south beach avenue
6 along wythe street on the southwest
7 vanderbilt ave
8 62nd street有没有办法匹配并得到这样的结果。请注意,这些地址是相似的,但并不完全相同。
ID| Address| ID2
1 Canal and Broadway 1
2 55 water street room number 73 2
3 Mulberry street 3
4 Front street and Fulton 4
5 62nd street 8
6 wythe street 6
7 vanderbilt avenue 7
8 South Beach avenue 5发布于 2021-05-05 11:21:21
正如@r2evans建议的那样,请查看fuzzyjoin包。这不会为您提供开箱即用的预期输出,但会帮助您入门。
fuzzyjoin::stringdist_left_join(df1, df2, by = 'Address', max_dist = 5)
# ID Address.x ID2 Address.y
#1 1 Canal and Broadway 1 Canal & Broadway
#2 2 55 water street room number 73 NA <NA>
#3 3 Mulberry street 3 Mulberry street
#4 4 Front street and Fulton NA <NA>
#5 5 62nd street 8 62nd street
#6 6 wythe street 8 62nd street
#7 7 vanderbilt avenue 7 vanderbilt ave
#8 8 South Beach avenue 5 south beach avenue您可能需要玩弄maxdist参数并应用其他规则才能得到最终输出。
发布于 2021-05-06 01:01:25
我们可以使用method = 'soundex
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = 'Address', method = 'soundex')https://stackoverflow.com/questions/67392470
复制相似问题