我想对列的状态和名称做精确的联接,但是对于" name“和”to“列使用模糊联接:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("@ george v. SALLY", "@laura v. dAvid")
df2 <- data.frame(year, state, versus)我喜欢的输出如下:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("@ george v. SALLY", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)
df3 <- data.frame(year, state, name, versus)我尝试了以下几种变体:
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")
stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")他们似乎没有接近我想要的。
我想知道我是否需要吐出“‘s”列(删除所有特殊字符并分隔名称),或者是否有一种方法可以在fuzzyjoin中实现这一点。如有任何指导,将不胜感激。
发布于 2022-07-12 16:51:07
一种简单的方法,在某种程度上取决于df2$versus的结构,它是这样的:
library(dplyr)
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))输出:
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George @ george v. SALLY
2 2002 TN Sally @ george v. SALLY
3 1999 AL David @laura v. dAvid
4 1999 AL Laura @laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA 最新情况/7月14日2022:
如果name有更复杂的模式,而不是一个单词(比如Molly Home, Jane Doe),我们需要一种方法来检索整个单词系列,并检查其中是否有一个(大小写不敏感)出现在versus列中。以下是一种简单的方法:
f(n,v))采用字符串n和v,从n中提取整个单词(wrds),然后计算在v中找到的单词数量。如果此计数超过0,则返回TRUE
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}左加入原始帧,并逐行应用f()
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))输出:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David @laura v. dAvid
4 1999 AL Laura @laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA 输入:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "@laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))发布于 2022-07-12 16:51:23
更新15/07:
见评论。在这种情况下,需要在versus中检查name中每个名称的匹配。可以这样做(使用@langtang的“新”数据):
df1 |>
left_join(df2, by = c("year", "state")) |>
rowwise() |>
mutate(versus = if_else(str_detect(tolower(versus), paste0(unlist(str_extract_all(tolower(name), "\\w+")), collapse = "|")), versus, NA_character_)) |>
ungroup()输出:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David @laura v. dAvid
4 1999 AL Laura @laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA 旧答案:
一种办法可以是:
library(tidyverse)
df1 |>
left_join(df2) |>
group_by(state) |>
mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
ungroup()输出:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George @ george v. SALLY
2 2002 TN Sally @ george v. SALLY
3 1999 AL David @laura v. dAvid
4 1999 AL Laura @laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA https://stackoverflow.com/questions/72955598
复制相似问题