首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何基于列中的字符串进行合并?

如何基于列中的字符串进行合并?
EN

Stack Overflow用户
提问于 2022-07-12 16:33:16
回答 2查看 66关注 0票数 -1

我想对列的状态和名称做精确的联接,但是对于" name“和”to“列使用模糊联接:

代码语言:javascript
复制
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")

df1 <- data.frame(year, state, name)

year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("@ george v. SALLY", "@laura v. dAvid")

df2 <- data.frame(year, state, versus)

我喜欢的输出如下:

代码语言:javascript
复制
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("@ george v. SALLY", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)

df3 <- data.frame(year, state, name, versus)

我尝试了以下几种变体:

代码语言:javascript
复制
library(fuzzyjoin)

stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")

stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")

他们似乎没有接近我想要的。

我想知道我是否需要吐出“‘s”列(删除所有特殊字符并分隔名称),或者是否有一种方法可以在fuzzyjoin中实现这一点。如有任何指导,将不胜感激。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-07-12 16:51:07

一种简单的方法,在某种程度上取决于df2$versus的结构,它是这样的:

代码语言:javascript
复制
library(dplyr)

left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))

输出:

代码语言:javascript
复制
  year  state name   versus           
  <chr> <chr> <chr>  <chr>            
1 2002  TN    George @ george v. SALLY
2 2002  TN    Sally  @ george v. SALLY
3 1999  AL    David  @laura v. dAvid  
4 1999  AL    Laura  @laura v. dAvid  
5 1997  CA    John   NA               
6 2002  TN    Kate   NA             

最新情况/7月14日2022:

如果name有更复杂的模式,而不是一个单词(比如Molly Home, Jane Doe),我们需要一种方法来检索整个单词系列,并检查其中是否有一个(大小写不敏感)出现在versus列中。以下是一种简单的方法:

  1. Create function (f(n,v))采用字符串nv,从n中提取整个单词(wrds),然后计算在v中找到的单词数量。如果此计数超过0

,则返回TRUE

代码语言:javascript
复制
f <- function(n,v) {
  wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
  sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}

左加入原始帧,并逐行应用f()

代码语言:javascript
复制
left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(f(name, versus), versus,NA_character_))

输出:

代码语言:javascript
复制
1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
2 2002  TN    Sally                 NA              
3 1999  AL    David                 @laura v. dAvid 
4 1999  AL    Laura                 @laura v. dAvid 
5 1997  CA    John                  NA              
6 2002  TN    Kate                  NA              

输入:

代码语言:javascript
复制
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997", 
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe", 
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA, 
-6L))

df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "@laura v. dAvid")), class = "data.frame", row.names = c(NA, 
-2L))
票数 1
EN

Stack Overflow用户

发布于 2022-07-12 16:51:23

更新15/07:

见评论。在这种情况下,需要在versus中检查name中每个名称的匹配。可以这样做(使用@langtang的“新”数据):

代码语言:javascript
复制
df1 |>
    left_join(df2, by = c("year", "state")) |>
    rowwise() |>
    mutate(versus = if_else(str_detect(tolower(versus), paste0(unlist(str_extract_all(tolower(name), "\\w+")), collapse = "|")), versus, NA_character_)) |>
    ungroup()

输出:

代码语言:javascript
复制
# A tibble: 6 × 4 
  year  state name                  versus          
  <chr> <chr> <chr>                 <chr>           
1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
2 2002  TN    Sally                 NA              
3 1999  AL    David                 @laura v. dAvid 
4 1999  AL    Laura                 @laura v. dAvid 
5 1997  CA    John                  NA              
6 2002  TN    Kate                  NA              

旧答案:

一种办法可以是:

代码语言:javascript
复制
library(tidyverse)

df1 |>
  left_join(df2) |>
  group_by(state) |>
  mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
  ungroup()

输出:

代码语言:javascript
复制
# A tibble: 6 × 4
  year  state name   versus           
  <chr> <chr> <chr>  <chr>            
1 2002  TN    George @ george v. SALLY
2 2002  TN    Sally  @ george v. SALLY
3 1999  AL    David  @laura v. dAvid  
4 1999  AL    Laura  @laura v. dAvid  
5 1997  CA    John   NA               
6 2002  TN    Kate   NA               
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72955598

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档