我有一个包含一些ID和每个ID的可变行数的数据帧,如下所示:
ID Date_start Date
1 2016-11-02 2020-2-22
1 2016-11-02 2015-1-18
2 2019-12-22 2017-3-2
2 2019-12-22 2019-2-9
2 2019-12-22 2017-12-1而且,对于每个ID,我只想保留一行,日期最接近Date_start的那一行。然后,我想将abs(日期和Date_start)之间的日期差小于100天的所有行都设置为NA。
有没有简单的方法可以做到这一点?
非常感谢你提前
发布于 2020-03-16 19:16:27
使用dplyr的一种方法是对ID执行group_by操作,并使用Date获取差异最小的行。
library(dplyr)
df %>%
mutate_at(-1, lubridate::ymd) %>%
group_by(ID) %>%
slice(which.min(abs(Date_start - Date)))
# ID Date_start Date
# <int> <date> <date>
#1 1 2016-11-02 2015-01-18
#2 2 2019-12-22 2019-02-09如果您想将日期设置为NA,您可以这样做。
df %>%
mutate_at(-1, lubridate::ymd) %>%
group_by(ID) %>%
mutate(diff = as.numeric(abs(Date_start - Date))) %>%
slice(which.min(abs(Date_start - Date))) %>%
mutate(diff = replace(diff, diff < 100, NA))data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L), Date_start = structure(c(1L,
1L, 2L, 2L, 2L), .Label = c("2016-11-02", "2019-12-22"), class = "factor"),
Date = structure(c(5L, 1L, 3L, 4L, 2L), .Label = c("2015-1-18",
"2017-12-1", "2017-3-2", "2019-2-9", "2020-2-22"), class = "factor")),
class = "data.frame", row.names = c(NA, -5L))发布于 2020-03-16 19:19:02
使用dplyr,我们可以得到以下结果。我添加了一个新的ID,以显示如何将所有值设置为NA。使用罗纳克更好的想法replace而不是ifelse,我们可以保留类。
library(dplyr)
df %>%
mutate(Date_diff = abs(difftime(Date, Date_start))) %>%
group_by(ID) %>%
filter(Date_diff == min(Date_diff)) %>%
mutate_all(~replace(., Date_diff < 100, NA))
#> # A tibble: 3 x 4
#> # Groups: ID [3]
#> ID Date_start Date Date_diff
#> <dbl> <date> <date> <drtn>
#> 1 1 2016-11-02 2015-01-18 654 days
#> 2 2 2019-12-22 2019-02-09 316 days
#> 3 3 NA NA NA daysData
df <- structure(list(ID = c(1, 1, 2, 2, 2, 3), Date_start = structure(c(17107,
17107, 18252, 18252, 18252, 18250), class = "Date"), Date = structure(c(18314,
16453, 17227, 17936, 17501, 18202), class = "Date")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))发布于 2020-03-16 19:31:54
Base R解决方案:
# Convert factors to dates:
cleaned_df <- within(df, {
Date_start <- as.Date(sapply(Date_start, as.character), "%Y-%m-%d")
Date <- as.Date(sapply(Date, as.character), "%Y-%m-%d")
}
)
# Aggregate to find the min Date per id:
data.frame(do.call("rbind", lapply(split(cleaned_df, cleaned_df$ID),
function(x){
data.frame(ID = unique(x$ID), Date = x$Date[which.min(x$Date_start - x$Date)])
}
)
),
row.names = NULL
)Tidyverse解决方案:
library(tidyverse)
df %>%
mutate_if(str_detect(tolower(names(.)), "date"), funs(as.Date(., "%Y-%m-%d"))) %>%
group_by(ID) %>%
summarise(Date = Date[which.min(Date - Date_start)]) %>%
ungroup()数据感谢@Ronak Shah:
df <-
structure(
list(
ID = c(1L, 1L, 2L, 2L, 2L),
Date_start = structure(
c(1L,
1L, 2L, 2L, 2L),
.Label = c("2016-11-02", "2019-12-22"),
class = "factor"
),
Date = structure(
c(5L, 1L, 3L, 4L, 2L),
.Label = c("2015-1-18",
"2017-12-1", "2017-3-2", "2019-2-9", "2020-2-22"),
class = "factor"
)
),
class = "data.frame",
row.names = c(NA,-5L)
)https://stackoverflow.com/questions/60704882
复制相似问题