我有一个药物I (NDC_NBR)及其对应的药物名称(BRAND_NM)的数据框架。
我需要将每个药物的名称压缩/聚合到尽可能少的特异性。
以下是我正在处理的数据和预期结果的示例:

data <- data.frame(NDC_NBR = c("00002773701","00002775201","00002775205","00002822201","00002822259","00002823301","00002823305","00074024302","00074006702","00074433902"),BRAND_NM = c("INSULIN LISPRO","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO PROTAMINE/INSUL","INSULIN LISPRO PROTAMINE/INSUL","HUMIRA","HUMIRA PEDIATRIC CROHNS DISEAS","HUMIRA PEN"), RESULT = c("INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","HUMIRA","HUMIRA","HUMIRA"))我希望将这些产品压缩为最不具体的药物名称,即改变一个新的列,其中包含字符“胰岛素LISPRO”和"HUMIRA“。
“胰岛素LISPRO”对前7行是常见的,而“胰岛素LISPRO KWIKPEN”只对7行中的2行是常见的。类似地,HUMIRA与任何“胰岛素”行都没有相似之处,但对最后3行都是常见的。
我有一个这样的产品的巨大数据框架,这是不可能手动转换每一个。
如果有人能为这样的问题提出解决方案,我将不胜感激。
发布于 2021-10-21 13:18:09
两种解决方案,每种方案对应不同的场景:
以下是针对此场景的两步解决方案:
首先,将短品牌名称定义为交替模式p
p <- paste0(BRAND_NM_short, collapse = "|")然后对gsub使用此模式,以匹配BRAND_NM中您想要保留的部分,并将BRAND_NM替换为匹配的短品牌名称:
library(dplyr)
data %>%
mutate(result = gsub(paste0("(", p, ").*"), "\\1", BRAND_NM))
NDC_NBR BRAND_NM result
1 00002773701 INSULIN LISPRO INSULIN LISPRO
2 00002775201 INSULIN LISPRO JUNIOR KWIKPEN INSULIN LISPRO
3 00002775205 INSULIN LISPRO JUNIOR KWIKPEN INSULIN LISPRO
4 00002822201 INSULIN LISPRO KWIKPEN INSULIN LISPRO
5 00002822259 INSULIN LISPRO KWIKPEN INSULIN LISPRO
6 00002823301 INSULIN LISPRO PROTAMINE/INSUL INSULIN LISPRO
7 00002823305 INSULIN LISPRO PROTAMINE/INSUL INSULIN LISPRO
8 00074024302 HUMIRA HUMIRA
9 00074006702 HUMIRA PEDIATRIC CROHNS DISEAS HUMIRA
10 00074433902 HUMIRA PEN HUMIRA对于这个问题,这里有一个更复杂的解决方案:
步骤1:运行for循环,根据动态模式检测从一行到下一行的任何重复单词:
# initialize vectors/columns:
p1 <- c()
data$repeats <- NA
# for loop to detect repeated words across rows:
library(stringr)
for(i in 2:nrow(data)){
p1[i-1] <- paste0(unlist(str_split(trimws(data$BRAND_NM[i-1]), "\\s+")), collapse = "|")
data$repeats[i] <- str_extract_all(data$BRAND_NM[i], p1[i-1])
}步骤2:mutate重复的单词,以获得理想的品牌名称,假设它们最多包含两个单词:
data %>%
mutate(result = sapply(repeats, function(x) paste(x[1], x[2], collapse = " ")),
result = sub("\\sNA", "", result),
result = ifelse(grepl("NA", result), lead(result), result)) %>%
select(-repeats)数据:
BRAND_NM_short = c("INSULIN LISPRO", "HUMIRA")
data <- data.frame(NDC_NBR = c("00002773701","00002775201","00002775205","00002822201","00002822259","00002823301","00002823305","00074024302","00074006702","00074433902"),
BRAND_NM = c("INSULIN LISPRO","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO PROTAMINE/INSUL","INSULIN LISPRO PROTAMINE/INSUL","HUMIRA","HUMIRA PEDIATRIC CROHNS DISEAS","HUMIRA PEN"))发布于 2021-10-21 15:18:11
这里有一种方法,假设第一个单词总是名称的一部分:
mydata <- data.frame(NDC_NBR = c("00002773701","00002775201","00002775205","00002822201",
"00002822259","00002823301","00002823305","00074024302",
"00074006702","00074433902"),
BRAND_NM = c("INSULIN LISPRO","INSULIN LISPRO JUNIOR KWIKPEN",
"INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO KWIKPEN",
"INSULIN LISPRO KWIKPEN","INSULIN LISPRO PROTAMINE/INSUL",
"INSULIN LISPRO PROTAMINE/INSUL","HUMIRA",
"HUMIRA PEDIATRIC CROHNS DISEAS","HUMIRA PEN"))
# add the first word as a main grouping column
mydata["main_group"] <- mydata %>% apply(1, function(x) strsplit(x["BRAND_NM"], " ")[[1]][1])
mydata["RESULT"] <- mydata %>%
# per group: get the number of words in common between all rows in group
group_by(main_group) %>%
mutate(intersect_cnt = Reduce(intersect, strsplit(BRAND_NM," ")) %>% length()) %>%
# extract the identified words
apply(1, function(x) paste(strsplit(x["BRAND_NM"], " ")[[1]][1:x["intersect_cnt"]], collapse = " "))输出:
mydata
# NDC_NBR BRAND_NM main_group RESULT
# 1 00002773701 INSULIN LISPRO INSULIN INSULIN LISPRO
# 2 00002775201 INSULIN LISPRO JUNIOR KWIKPEN INSULIN INSULIN LISPRO
# 3 00002775205 INSULIN LISPRO JUNIOR KWIKPEN INSULIN INSULIN LISPRO
# 4 00002822201 INSULIN LISPRO KWIKPEN INSULIN INSULIN LISPRO
# 5 00002822259 INSULIN LISPRO KWIKPEN INSULIN INSULIN LISPRO
# 6 00002823301 INSULIN LISPRO PROTAMINE/INSUL INSULIN INSULIN LISPRO
# 7 00002823305 INSULIN LISPRO PROTAMINE/INSUL INSULIN INSULIN LISPRO
# 8 00074024302 HUMIRA HUMIRA HUMIRA
# 9 00074006702 HUMIRA PEDIATRIC CROHNS DISEAS HUMIRA HUMIRA
# 10 00074433902 HUMIRA PEN HUMIRA HUMIRAhttps://stackoverflow.com/questions/69661752
复制相似问题