我遇到了一些我从未做过的新事情,我希望能得到一些帮助。我正在尝试连接两个数据集(非常简单),但其中两个列中只有部分字符串匹配。我试着使用fuzzy_join,但我无法让它为我工作。下面是我想做的事。我希望最终得到一个名为df_final的数据框架。有什么想法吗?
df1 <- tribble(
~student_id, ~course, ~grade,
"001", "social studies grade", "A",
"001", "ela grade", "A",
"001", "math grade", "A",
"002", "social studies grade", "B",
"002", "ela grade", "B",
"002", "math grade", "B",
"003", "social studies grade", "C",
"003", "ela grade", "C",
"003", "math grade", "C",
"004", "social studies grade", "C",
"004", "ela grade", "C",
"004", "math grade", "C",
"005", "social studies grade", "C",
"005", "ela grade", "C",
"005", "math grade", "C",
)
df2 <- tribble(
~student_id, ~course,
"001", "5th Social Studies",
"001", "5th ELA",
"001", "5th Mathematics",
"002", "6th Social Studies",
"002", "6th ELA",
"002", "6th Mathematics",
"003", "8th Social Studies",
"003", "8th ELA",
"003", "8th Mathematics",
)
df_final <- tribble(
~student_id, ~course, ~grade,
"001", "5th Social Studies", "A",
"001", "5th ELA", "A",
"001", "5th Mathematics", "A",
"002", "6th Social Studies", "B",
"002", "6th ELA", "B",
"002", "6th Mathematics", "B",
"003", "8th Social Studies", "C",
"003", "8th ELA", "C",
"003", "8th Mathematics", "C"
)发布于 2021-06-29 16:14:06
我们可以使用fuzzyjoin。从两个数据集中的“course”列中获取子字符串后执行regex_left_join (以使其更加匹配)
library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>%
mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>%
regex_left_join(df1 %>%
mutate(grp = toupper(str_remove(course,
"\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>%
select(student_id = student_id.x, course, grade)-output
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C OP的预期输出是
df_final
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C 发布于 2021-06-29 19:13:50
我建议探索使用常用的比较字符串距离的方法的可能性,我认为这构成了实际的模糊匹配方法。基于Regex的匹配是简单的确定性匹配(同样,您可以先转换列,然后执行左联接)。下面的解决方案使用comparator包导出匹配函数,返回Jaro字符串距离,并进行“模糊”匹配。在您的特定示例中,regex可能更合适,但在NLP中,实际的“模糊性”通常是在字符串距离或更复杂的NLP处理的上下文中定义的,在某种情况下,您可能希望解释常见的拼写错误,等等。将其封装在正则表达式中将证明是很麻烦的。
jw <- comparator::JaroWinkler()
fuzzyjoin::fuzzy_left_join(
x = df1, y = df2, by = "course",
match_fun = function(x, y) { jw(x, y) > 0.1}
) 示例
如果世界grade被错误地键入为grade,那么您的正则表达式将不幸失败,但是JaroWinkler字符串距离将显示出很强的接近性:
>> jw("grade", "grde")
[1] 0.9466667如果您处理的是不同的数据集,而不能依赖具有定义良好模式的列,则这是健壮的。
https://stackoverflow.com/questions/68182139
复制相似问题