首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >R中部分字符串匹配的模糊连接

R中部分字符串匹配的模糊连接
EN

Stack Overflow用户
提问于 2021-06-29 15:59:31
回答 2查看 826关注 0票数 3

我遇到了一些我从未做过的新事情,我希望能得到一些帮助。我正在尝试连接两个数据集(非常简单),但其中两个列中只有部分字符串匹配。我试着使用fuzzy_join,但我无法让它为我工作。下面是我想做的事。我希望最终得到一个名为df_final的数据框架。有什么想法吗?

代码语言:javascript
复制
df1 <- tribble(
  ~student_id, ~course, ~grade, 
  "001", "social studies grade", "A", 
  "001", "ela grade", "A", 
  "001", "math grade", "A", 
  "002", "social studies grade", "B", 
  "002", "ela grade", "B", 
  "002", "math grade", "B", 
  "003", "social studies grade", "C", 
  "003", "ela grade", "C", 
  "003", "math grade", "C", 
  "004", "social studies grade", "C", 
  "004", "ela grade", "C", 
  "004", "math grade", "C", 
  "005", "social studies grade", "C", 
  "005", "ela grade", "C", 
  "005", "math grade", "C", 
)

df2 <- tribble(
  ~student_id, ~course,
  "001", "5th Social Studies",
  "001", "5th ELA",
  "001", "5th Mathematics",
  "002", "6th Social Studies", 
  "002", "6th ELA",
  "002", "6th Mathematics",
  "003", "8th Social Studies",
  "003", "8th ELA",
  "003", "8th Mathematics",
)

df_final <- tribble(
  ~student_id, ~course, ~grade,
  "001", "5th Social Studies", "A",
  "001", "5th ELA", "A",
  "001", "5th Mathematics", "A",
  "002", "6th Social Studies", "B",
  "002", "6th ELA", "B",
  "002", "6th Mathematics", "B",
  "003", "8th Social Studies", "C",
  "003", "8th ELA", "C",
  "003", "8th Mathematics", "C"
)
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-06-29 16:14:06

我们可以使用fuzzyjoin。从两个数据集中的“course”列中获取子字符串后执行regex_left_join (以使其更加匹配)

代码语言:javascript
复制
library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>% 
   mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>% 
   regex_left_join(df1 %>%
       mutate(grp = toupper(str_remove(course, 
     "\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>% 
   select(student_id = student_id.x, course, grade)

-output

代码语言:javascript
复制
# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C    

OP的预期输出是

代码语言:javascript
复制
 df_final
# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C    
票数 1
EN

Stack Overflow用户

发布于 2021-06-29 19:13:50

我建议探索使用常用的比较字符串距离的方法的可能性,我认为这构成了实际的模糊匹配方法。基于Regex的匹配是简单的确定性匹配(同样,您可以先转换列,然后执行左联接)。下面的解决方案使用comparator包导出匹配函数,返回Jaro字符串距离,并进行“模糊”匹配。在您的特定示例中,regex可能更合适,但在NLP中,实际的“模糊性”通常是在字符串距离或更复杂的NLP处理的上下文中定义的,在某种情况下,您可能希望解释常见的拼写错误,等等。将其封装在正则表达式中将证明是很麻烦的。

代码语言:javascript
复制
jw <- comparator::JaroWinkler()

fuzzyjoin::fuzzy_left_join(
    x = df1, y = df2, by = "course",
    match_fun = function(x, y) { jw(x, y) > 0.1}
) 

示例

如果世界grade被错误地键入为grade,那么您的正则表达式将不幸失败,但是JaroWinkler字符串距离将显示出很强的接近性:

代码语言:javascript
复制
>> jw("grade", "grde")
[1] 0.9466667

如果您处理的是不同的数据集,而不能依赖具有定义良好模式的列,则这是健壮的。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68182139

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档