我在R中有两个要连接在一起的数据,但是其中一个列的值是一个或两个值(特别是每个列中的yardline_100列)。下面是我用来加入这两者的代码:
fin_df <- df1 %>%
left_join(df2,
by = c("posteam" = "posteam",
"qtr" = "qtr",
"down" = "down",
"yardline_100" = "yardline_100"))有什么办法可以使他们加入,即使这一列是一个或两个?您会注意到,最后两个值行在该列中有不同的数字。以下是数据格式的示例:
df1 <- structure(list(play_id = c(4596, 4629, 4658, 4682, 4723, 4766,
4790, 4828, 4849, 4878, 4899, 4938), posteam = c("MIN", "MIN",
"MIN", "MIN", "MIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN",
"CIN"), qtr = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), yardline_100 = c(63,
58, 55, 50, 38, 61, 55, 52, 52, 20, 15, 15), down = c(2, 1, 2,
3, 1, 1, 2, 3, 4, 1, 2, 3)), row.names = c(NA, -12L), class = c("nflverse_data",
"tbl_df", "tbl", "data.table", "data.frame"), nflverse_timestamp = structure(1659046255.35538, class = c("POSIXct",
"POSIXt")), nflverse_type = "play by play", nflfastR_version = structure(list(
c(4L, 3L, 0L, 9020L)), class = c("package_version", "numeric_version"
)), .internal.selfref = <pointer: 0x0000021967f81ef0>)
df2 <- structure(list(posteam = c("MIN", "MIN", "MIN", "MIN", "MIN",
"CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN"), qtr = c(5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), yardline_100 = c(63, 58, 55,
50, 38, 61, 55, 53, 52, 20, 16, 16), down = c(2, 1, 2, 3, 1,
1, 2, 3, 4, 1, 2, 3), play_id_SR = c("a9f97fb0-1407-11ec-ae9a-d77d9ecb2022",
"d49d54d0-1407-11ec-ae9a-d77d9ecb2022", "e8f74ad0-1407-11ec-ae9a-d77d9ecb2022",
"0208ae60-1408-11ec-ae9a-d77d9ecb2022", "257fd030-1408-11ec-ae9a-d77d9ecb2022",
"fe058030-1408-11ec-ae9a-d77d9ecb2022", "0da68200-1409-11ec-ae9a-d77d9ecb2022",
"26a5bd20-1409-11ec-ae9a-d77d9ecb2022", "70eacce0-1409-11ec-ae9a-d77d9ecb2022",
"99e5fb10-1409-11ec-ae9a-d77d9ecb2022", "a7646b00-1409-11ec-ae9a-d77d9ecb2022",
"de2683d0-1409-11ec-ae9a-d77d9ecb2022")), row.names = c(NA, -12L
), class = c("tbl_df", "tbl", "data.frame"))发布于 2022-08-24 00:10:46
一个选项是使用fuzzyjoin。
library(fuzzyjoin)
df1 %>%
fuzzy_left_join(
df2,
by = c("posteam", "qtr", "down", "yardline_100"),
match_fun = list(`==`, `==`, `==`, function(x, y) abs(x - y) <= 2)) %>%
select(-matches("(posteam|qtr|down).y")) %>%
rename_with(~str_remove(.x, "(?<=(posteam|qtr|down)).x"))
## A tibble: 12 x 7
# play_id posteam qtr yardline_100.x down yardline_100.y play_id_SR
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 4596 MIN 5 63 2 63 a9f97fb0-1407-11ec-ae9a-d77d9ecb2022
# 2 4629 MIN 5 58 1 58 d49d54d0-1407-11ec-ae9a-d77d9ecb2022
# 3 4658 MIN 5 55 2 55 e8f74ad0-1407-11ec-ae9a-d77d9ecb2022
# 4 4682 MIN 5 50 3 50 0208ae60-1408-11ec-ae9a-d77d9ecb2022
# 5 4723 MIN 5 38 1 38 257fd030-1408-11ec-ae9a-d77d9ecb2022
# 6 4766 CIN 5 61 1 61 fe058030-1408-11ec-ae9a-d77d9ecb2022
# 7 4790 CIN 5 55 2 55 0da68200-1409-11ec-ae9a-d77d9ecb2022
# 8 4828 CIN 5 52 3 53 26a5bd20-1409-11ec-ae9a-d77d9ecb2022
# 9 4849 CIN 5 52 4 52 70eacce0-1409-11ec-ae9a-d77d9ecb2022
#10 4878 CIN 5 20 1 20 99e5fb10-1409-11ec-ae9a-d77d9ecb2022
#11 4899 CIN 5 15 2 16 a7646b00-1409-11ec-ae9a-d77d9ecb2022
#12 4938 CIN 5 15 3 16 de2683d0-1409-11ec-ae9a-d77d9ecb2022请注意列function(x, y) abs(x - y) <= 2的匹配函数"yardline_100"。
最后两行(select(...)和rename_with(...))是删除重复列所必需的:fuzzyjoin似乎创建重复列(即".x“和".y"-suffixed)列,甚至在精确匹配时也是如此;最后两个命令删除这些重复的精确匹配列。
https://stackoverflow.com/questions/73465795
复制相似问题