首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >基于相似性将R中的字符整理到最不具体的细节

基于相似性将R中的字符整理到最不具体的细节
EN

Stack Overflow用户
提问于 2021-10-21 12:10:25
回答 2查看 62关注 0票数 0

我有一个药物I (NDC_NBR)及其对应的药物名称(BRAND_NM)的数据框架。

我需要将每个药物的名称压缩/聚合到尽可能少的特异性。

以下是我正在处理的数据和预期结果的示例:

代码语言:javascript
复制
data <- data.frame(NDC_NBR = c("00002773701","00002775201","00002775205","00002822201","00002822259","00002823301","00002823305","00074024302","00074006702","00074433902"),BRAND_NM = c("INSULIN LISPRO","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO PROTAMINE/INSUL","INSULIN LISPRO PROTAMINE/INSUL","HUMIRA","HUMIRA PEDIATRIC CROHNS DISEAS","HUMIRA PEN"), RESULT = c("INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","HUMIRA","HUMIRA","HUMIRA"))

我希望将这些产品压缩为最不具体的药物名称,即改变一个新的列,其中包含字符“胰岛素LISPRO”和"HUMIRA“。

“胰岛素LISPRO”对前7行是常见的,而“胰岛素LISPRO KWIKPEN”只对7行中的2行是常见的。类似地,HUMIRA与任何“胰岛素”行都没有相似之处,但对最后3行都是常见的。

我有一个这样的产品的巨大数据框架,这是不可能手动转换每一个。

如果有人能为这样的问题提出解决方案,我将不胜感激。

EN

回答 2

Stack Overflow用户

发布于 2021-10-21 13:18:09

两种解决方案,每种方案对应不同的场景:

  • Scenario # 1 -你有一个理想品牌的列表:

以下是针对此场景的两步解决方案:

首先,将短品牌名称定义为交替模式p

代码语言:javascript
复制
p <- paste0(BRAND_NM_short, collapse = "|")

然后对gsub使用此模式,以匹配BRAND_NM中您想要保留的部分,并将BRAND_NM替换为匹配的短品牌名称:

代码语言:javascript
复制
library(dplyr)
data %>%
  mutate(result = gsub(paste0("(", p, ").*"), "\\1", BRAND_NM))
       NDC_NBR                       BRAND_NM         result
1  00002773701                 INSULIN LISPRO INSULIN LISPRO
2  00002775201  INSULIN LISPRO JUNIOR KWIKPEN INSULIN LISPRO
3  00002775205  INSULIN LISPRO JUNIOR KWIKPEN INSULIN LISPRO
4  00002822201         INSULIN LISPRO KWIKPEN INSULIN LISPRO
5  00002822259         INSULIN LISPRO KWIKPEN INSULIN LISPRO
6  00002823301 INSULIN LISPRO PROTAMINE/INSUL INSULIN LISPRO
7  00002823305 INSULIN LISPRO PROTAMINE/INSUL INSULIN LISPRO
8  00074024302                         HUMIRA         HUMIRA
9  00074006702 HUMIRA PEDIATRIC CROHNS DISEAS         HUMIRA
10 00074433902                     HUMIRA PEN         HUMIRA

  • Scenario #2:你没有一个理想品牌的列表。

对于这个问题,这里有一个更复杂的解决方案:

步骤1:运行for循环,根据动态模式检测从一行到下一行的任何重复单词:

代码语言:javascript
复制
# initialize vectors/columns:
p1 <- c()
data$repeats <- NA

# for loop to detect repeated words across rows:
library(stringr)
for(i in 2:nrow(data)){
  p1[i-1] <- paste0(unlist(str_split(trimws(data$BRAND_NM[i-1]), "\\s+")), collapse = "|")
  data$repeats[i] <- str_extract_all(data$BRAND_NM[i], p1[i-1])
}

步骤2:mutate重复的单词,以获得理想的品牌名称,假设它们最多包含两个单词:

代码语言:javascript
复制
data %>%
  mutate(result = sapply(repeats, function(x) paste(x[1], x[2], collapse = " ")),
         result = sub("\\sNA", "", result),
         result = ifelse(grepl("NA", result), lead(result), result)) %>%
  select(-repeats)

数据:

代码语言:javascript
复制
BRAND_NM_short = c("INSULIN LISPRO", "HUMIRA")

data <- data.frame(NDC_NBR = c("00002773701","00002775201","00002775205","00002822201","00002822259","00002823301","00002823305","00074024302","00074006702","00074433902"),
                   BRAND_NM = c("INSULIN LISPRO","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO PROTAMINE/INSUL","INSULIN LISPRO PROTAMINE/INSUL","HUMIRA","HUMIRA PEDIATRIC CROHNS DISEAS","HUMIRA PEN"))
票数 0
EN

Stack Overflow用户

发布于 2021-10-21 15:18:11

这里有一种方法,假设第一个单词总是名称的一部分:

代码语言:javascript
复制
mydata <- data.frame(NDC_NBR = c("00002773701","00002775201","00002775205","00002822201",
                                 "00002822259","00002823301","00002823305","00074024302",
                                 "00074006702","00074433902"),
                     BRAND_NM = c("INSULIN LISPRO","INSULIN LISPRO JUNIOR KWIKPEN",
                                  "INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO KWIKPEN",
                                  "INSULIN LISPRO KWIKPEN","INSULIN LISPRO PROTAMINE/INSUL",
                                  "INSULIN LISPRO PROTAMINE/INSUL","HUMIRA",
                                  "HUMIRA PEDIATRIC CROHNS DISEAS","HUMIRA PEN")) 

# add the first word as a main grouping column
mydata["main_group"] <- mydata %>% apply(1, function(x) strsplit(x["BRAND_NM"], " ")[[1]][1])


mydata["RESULT"] <- mydata %>% 
  # per group: get the number of words in common between all rows in group
  group_by(main_group) %>% 
  mutate(intersect_cnt = Reduce(intersect, strsplit(BRAND_NM," ")) %>% length()) %>% 
  # extract the identified words
  apply(1, function(x) paste(strsplit(x["BRAND_NM"], " ")[[1]][1:x["intersect_cnt"]], collapse = " "))

输出:

代码语言:javascript
复制
mydata
#    NDC_NBR                       BRAND_NM main_group         RESULT
# 1  00002773701                 INSULIN LISPRO    INSULIN INSULIN LISPRO
# 2  00002775201  INSULIN LISPRO JUNIOR KWIKPEN    INSULIN INSULIN LISPRO
# 3  00002775205  INSULIN LISPRO JUNIOR KWIKPEN    INSULIN INSULIN LISPRO
# 4  00002822201         INSULIN LISPRO KWIKPEN    INSULIN INSULIN LISPRO
# 5  00002822259         INSULIN LISPRO KWIKPEN    INSULIN INSULIN LISPRO
# 6  00002823301 INSULIN LISPRO PROTAMINE/INSUL    INSULIN INSULIN LISPRO
# 7  00002823305 INSULIN LISPRO PROTAMINE/INSUL    INSULIN INSULIN LISPRO
# 8  00074024302                         HUMIRA     HUMIRA         HUMIRA
# 9  00074006702 HUMIRA PEDIATRIC CROHNS DISEAS     HUMIRA         HUMIRA
# 10 00074433902                     HUMIRA PEN     HUMIRA         HUMIRA
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69661752

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档