因此,在dataset中,我有一个名为“干预”的列,每行如下所示:
row1:“药物:利妥昔单抗-药物:Utomilumab-药物:Avelumab-药物: PF04518600”
row2:“生物学:alemtuzumab-生物学:供体淋巴细胞-药物:卡莫嗪药物:阿糖胞苷药物:依托泊苷药物:三聚氰胺程序:异基因骨丸”
我只想提取“药物”、“生物”、“程序”等干预类型留在专栏中。更好的是,如果只能有独特的干预类型,而不是像第一行那样的4倍的“药物”。
预期的输出如下所示:
row1:“毒品”
row2:“生物、药物、程序”
我刚刚开始使用r,我已经安装了tidyverse,并且有点习惯于玩%>%。如果有人能帮我做这件事,非常感谢!
发布于 2019-10-12 21:13:44
如果我们只想在:之前提取前缀部分
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
df1 %>%
mutate(Interventions = map_chr(str_extract_all(Interventions,
"\\w+(?=:)"), ~ toString(sort(unique(.x)))))
# Interventions
#1 Drug
#2 Biological, Drug, Procedure或者另一种选择是根据分隔符、slice替换行和paste一起分隔行,在“干预”中将sorted unique值分开。
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Interventions, sep="[:|]") %>%
group_by(rn) %>%
slice(seq(1, n(), by = 2)) %>%
distinct() %>%
summarise(Interventions = toString(sort(unique(Interventions)))) %>%
ungroup %>%
select(-rn)
# A tibble: 2 x 1
# Interventions
# <chr>
#1 Drug
#2 Biological, Drug, Procedure数据
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))发布于 2019-10-13 07:58:30
不像Akrun那样简洁和相同的逻辑,但是在Base R中:
# Create df:
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
# Assign a row id vec:
df1$row_num <- 1:nrow(df1)
# Split string on | delim:
split_up <- strsplit(df1$Interventions, split = "[|]")
# Roll down the dataframe - keep uniques:
rolled_out <- unique(data.frame(row_num = rep(df1$row_num, sapply(split_up, length)),
Interventions = gsub("[:].*","", unlist(split_up))))
# Stack the dataframe:
df2 <- aggregate(Interventions~row_num, rolled_out, paste0, collapse = ", ")
# Drop id vec:
df2 <- within(df2, rm("row_num"))https://stackoverflow.com/questions/58358515
复制相似问题