我有以下包含列X和Y的数据框架,
X Y
1 SAN DIEGO FOND DU LAC
2 THE RIO GRANDE RIO GRANDE
3 RIO GRANDE RIO GRANDE
4 WEST TENNESSEE TENNESSEE
5 EP De SAN JOAQUIN De SAN JOAQUIN
6 SOUTHERN VIRGINIA VIRGINIA
7 SOUTHERN VIRGINIA SOUTHWESTERN VIRGINIA
8 EN COLOMBIA COLOMBIA
9 THE EP De NORTHERN CALIFORNIA De NORTHERN CALIFORNIA
10 FLORIDA NEW JERSY我想要得到不匹配的行,1和10。行2-9是匹配的或接近匹配的,没有问题。我期望的数据帧是
X Y
1 SAN DIEGO FOND DU LAC
10 FLORIDA NEW JERSY发布于 2017-06-04 13:34:18
在R中,我们将字符串按每列中的空格拆分,检查单词之间是否有intersect,找到list的lengths,并将长度为0的数据集设置为子集
df1[!lengths(Map(intersect, strsplit(df1$X, "\\s+"), strsplit(df1$Y, "\\s+"))),]
# X Y
#1 SAN DIEGO FOND DU LAC
#10 FLORIDA NEW JERSY我们也可以循环遍历列,而不是按每列拆分,而是执行split
df1[!lengths(do.call(Map, c(intersect, unname(lapply(df1, strsplit, split="\\s+"))))),]
# X Y
#1 SAN DIEGO FOND DU LAC
#10 FLORIDA NEW JERSY或者另一种选择是stringdist
library(stringdist)
i1 <- with(df1, stringdist(X, Y, method = "qgram"))
df1[i1 %in% tail(sort(i1), 2),]
# X Y
#1 SAN DIEGO FOND DU LAC
#10 FLORIDA NEW JERSYhttps://stackoverflow.com/questions/44351043
复制相似问题