我观察了2-4年的a-d受试者,他们每年都报告一个数值.我想提取每个主题的第一个和最后一个值,忽略NAs。如何创建新变量first_value和last_value?在本例中,我包含了所需的结果:
df <- data.frame(subject = c("a","b","c","d"),
year1 = c(1, 2, NA, NA),
year2 = c(3, 4, NA, 5),
year3 = c(6, 7, 8, NA),
year4 = c(9, 10, NA, 11),
first_value <- c(1, 2, 8, 5),
last_value <- c(9, 10, 8, 11))如果变量year1-year4是绝对的,那么解决方案是什么?
发布于 2016-11-10 12:42:35
使用data.table包:
library(data.table)
setDT(df)[, `:=` (first_value = na.omit(unlist(.SD))[1],
last_value = tail(na.omit(unlist(.SD)),1)),
by = subject][]这意味着:
subject year1 year2 year3 year4 first_value last_value
1: a 1 3 6 9 1 9
2: b 2 4 7 10 2 10
3: c NA NA 8 NA 8 8
4: d NA 5 NA 11 5 11按照@alexis_laz的建议,您可以如下所示使用max.col来获得回复值:
f <- max.col(!is.na(df[c("year1", "year2", "year3", "year4")]), 'first')
l <- max.col(!is.na(df[c("year1", "year2", "year3", "year4")]), 'last')
df$first_value <- sapply(seq_along(f), function(i) df[,-1][i,f[i]])
df$last_value <- sapply(seq_along(l), function(i) df[,-1][i,l[i]])这会给你带来同样的结果。正如@alexis_laz在评论中所建议的,这可以进一步改进为:
m <- as.matrix(df[c("year1", "year2", "year3", "year4")])
f <- max.col(!is.na(m), 'first')
l <- max.col(!is.na(m), 'last')
df$first_value <- df[-1][cbind(1:nrow(df), f)]
df$last_value <- df[-1][cbind(1:nrow(df), l)]并使用dplyr和tidyr包:
library(dplyr)
library(tidyr)
df %>%
gather(year, val, 2:5) %>%
filter(!is.na(val)) %>%
group_by(subject) %>%
summarise(first_value = first(val),
last_value = last(val)) %>%
left_join(df, ., by = 'subject')警告:在不使用filter和在summarise中使用na.omit(val) (或val[!is.na(val)])的情况下对此进行更改
df %>%
gather(year, val, 2:5) %>%
group_by(subject) %>%
summarise(first_value = first(na.omit(val)),
last_value = last(na.omit(val))) %>%
left_join(df, ., by = 'subject')发布于 2016-11-10 12:21:35
使用data.frame和gather
#Used packages
library(tidyr)
library(dplyr)
subject<-c("a","b","c","d")
year1 <- c(1, 2, NA, NA)
year2 <- c(3, 4, NA, 5)
year3 <- c(6, 7, 8, NA)
year4 <- c(9, 10, NA, 11)
dt = data.frame(subject, year1, year2, year3, year4) 将多个列折叠为单个列的gather()
dt <- dt %>% gather(year, value, year1:year4)summarise( ):对所选变量执行汇总统计
dt %>% group_by(subject)%>%
summarise(first_value = min(value, na.rm=TRUE),
last_value = max(value, na.rm=TRUE))产出:
# A tibble: 4 × 3
subject first_value last_value
<fctr> <dbl> <dbl>
1 a 1 9
2 b 2 10
3 c 8 8
4 d 5 11https://stackoverflow.com/questions/40527298
复制相似问题