我有一个如下所示的数据集:
Study_ID Stage
1 100 Early Stage
2 100 Stable
3 200 Stable
4 300 Early Stage
5 400 Early Stage
6 400 Stable
7 500 Early Stage
8 500 Stable
9 600 Stable
10 700 Early Stage我想删除任何重复的研究is,但保留条目的病人是“稳定的”。换句话说,我想删除每一个重复的学习ID,病人是‘早期’。
我想要的输出应该如下所示:
Study_ID Stage
1 100 Stable
2 200 Stable
3 300 Early Stage
4 400 Stable
5 500 Stable
6 600 Stable
7 700 Early Stage我该怎么做呢?
可复制的数据:
data<-data.frame(Study_ID=c("100","100","200","300","400","400","500","500","600","700"),Stage=c("Early Stage","Stable","Stable","Early Stage","Early Stage","Stable","Early Stage","Stable","Stable","Early Stage"))发布于 2022-06-30 14:36:42
library(dplyr)
data %>%
group_by(Study_ID) %>%
filter(!(n() > 1 & Stage != "Stable"))
#> # A tibble: 7 × 2
#> # Groups: Study_ID [7]
#> Study_ID Stage
#> <chr> <chr>
#> 1 100 Stable
#> 2 200 Stable
#> 3 300 Early Stage
#> 4 400 Stable
#> 5 500 Stable
#> 6 600 Stable
#> 7 700 Early Stage编辑1
为了确保您没有重复的行(正如@jay.sf所指出的,您可以执行以下操作(混乱)):
library(dplyr)
dat %>%
group_by(Study_ID) %>%
filter(!(n() > 1 & Stage != "Stable")) %>%
summarise(Stage = first(Stage))
#> # A tibble: 7 × 2
#> Study_ID Stage
#> <int> <chr>
#> 1 100 Stable
#> 2 200 Stable
#> 3 300 Early Stage
#> 4 400 Stable
#> 5 500 Stable
#> 6 600 Stable
#> 7 700 Early Stage发布于 2022-06-30 14:38:51
使用by。我在数据中添加了一个带有两个“稳定”的案例,作为可能的特例。
by(dat, dat$Study_ID, \(x) {
if (any(grepl('Stable', x$Stage))) {
unique(x[x$Stage == 'Stable', ])
} else {
unique(x)
}
}) |> do.call(what=rbind)
# Study_ID Stage
# 100 100 Stable
# 200 200 Stable
# 300 300 Early Stage
# 400 400 Stable
# 500 500 Stable
# 600 600 Stable
# 700 700 Early Stage或者使用舞台as.factor和ave !duplicated max。
transform(dat, x=as.numeric(as.factor(Stage))) |>
subset(as.logical(ave(x, Study_ID, FUN=\(x) x == max(x) & !duplicated(x))) , -x)
# Study_ID Stage
# 2 100 Stable
# 3 200 Stable
# 4 300 Early Stage
# 6 400 Stable
# 8 500 Stable
# 9 600 Stable
# 11 700 Early Stage注意,这是因为“早期”在字母表中“稳定”之前,否则使用factor并在参数中定义levels=顺序。
数据:
dat <- structure(list(Study_ID = c(100L, 100L, 200L, 300L, 400L, 400L,
500L, 500L, 600L, 600L, 700L), Stage = c("Early Stage", "Stable",
"Stable", "Early Stage", "Early Stage", "Stable", "Early Stage",
"Stable", "Stable", "Stable", "Early Stage")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))https://stackoverflow.com/questions/72817572
复制相似问题