我想删除字符串中的重复项。例如,Predictive Modeling是第一行中重复的值。需要确保删除重复项后,字符串没有额外的,
mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, Predictive Modeling, SQL, Tableau, data analysis", "SQL, Tableau, data analysis, data analysis", "Predictive Modeling, Python, SQL, visualization, Spark, Tableau"))期望输出
mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, SQL, Tableau, data analysis", "SQL, Tableau, data analysis", "Predictive Modeling, Python, SQL, visualization, Spark, Tableau"))发布于 2022-03-27 08:33:52
这里是一条使用toString的单线线.
transform(mydf, Keyword=sapply(strsplit(Keyword, ', '), \(x) toString(unique(x))))
# Keyword
# 1 Predictive Modeling, R, Python, SQL, Tableau, data analysis
# 2 SQL, Tableau, data analysis
# 3 Predictive Modeling, Python, SQL, visualization, Spark, Tableau发布于 2022-03-27 07:49:38
这里是一种基于正则表达式的方法。我们可以用空字符串替换任何CSV术语,该术语在字符串后面也会出现。
mydf <- data.frame(Keyword = c("Predictive Modeling, R, Python, Predictive Modeling, SQL, Tableau, data analysis", "SQL, Tableau, data analysis, data analysis"))
mydf$Keyword <- gsub("\\s*([^,]+),?(?=.*\\1(?:,|$))", "",
mydf$Keyword, perl=TRUE)
mydf
1 R, Python, Predictive Modeling, SQL, Tableau, data analyis
2 SQL, Tableau, data analyisNotr这个方法保留了CSV术语的最后一个实例,但也许这对于您的需求是可以接受的。
发布于 2022-03-27 15:35:35
使用tidyverse
library(dplyr)
library(tidyr)
mydf %>%
mutate(rn = row_number()) %>%
separate_rows(Keyword, sep =",\\s+") %>%
distinct() %>%
group_by(rn) %>%
summarise(Keyword = toString(Keyword), .groups = "drop") %>%
select(-rn)-output
# A tibble: 3 × 1
Keyword
<chr>
1 Predictive Modeling, R, Python, SQL, Tableau, data analysis
2 SQL, Tableau, data analysis
3 Predictive Modeling, Python, SQL, visualization, Spark, Tableauhttps://stackoverflow.com/questions/71634465
复制相似问题