我希望你们都很好。
我有一个包含许多列的数据集,我正在尝试根据多个条件删除重复的数据集。下面我提供一个示例来演示我的问题。其思想是,对于每个ID,检查所有列,如果所有列都相同,则保留最新的列。如果有两个相同的行,而上面的注释是不同的,那么检查该行是否为"Add comment for down/upgrading client",如果所有行都有相同的注释,则保留第一行,否则保留最新的行,不包含上面的注释。
我一直在尝试以下几种方法
##dataframe
ID <- c("H1", "H1"," H1"," H2", "H2", "H3", "H3"," H3", "H4")
rating <-c("C", "C", "C+","D", "C", "C", "C+", "C+", "C")
Commnets<- c("Add comment for down/upgrading client", "updated", "Add comment for down/upgrading client","Add comment for down/upgrading client","Add comment for down/upgrading client",
"down", "down", "Add comment for down/upgrading client", "Add comment for down/upgrading client")
Date<- c("2018-12-10", "2018-12-10", "2018-11-10",
"2018-11-10","2018-11-10",
"2018-10-10", "2018-10-02", "2018-10-02", "2020-09-03")
df<-data.frame(ID,rating,Commnets,Date,stringsAsFactors=FALSE)
df$Date<-as.Date(df$Date)
df<-df%>%
group_by(ID,rating,Date)%>%
arrange(desc(Date)) %>% # in each group, arrange in desc by Date
filter(row_number() == 1)#this will solve the first problem
df$Date<-as.Date(df$Date)
df<-df%>%
group_by(ID,rating,Date)%>%
arrange(desc(Date)) %>% #I think that I need **do** here but not sure how
ifelse(rowSums("Add comment for down/upgrading client" == $Comments)==length($Comments),
filter(row_number() == 1),rowSums("Add comment for down/upgrading client" == $Comments)[1,])发布于 2021-02-02 11:14:45
您可以通过递减Date顺序和计数每个ID、rating和Date的唯一Commnets数来arrange数据。如果始终都是相同的注释,则选择第一行,如果不同,则选择最后一行,即最新的。
library(dplyr)
df %>%
mutate(ID = trimws(ID),
Date = as.Date(Date)) %>%
arrange(ID, rating, Commnets, desc(Date)) %>%
group_by(ID,rating,Date) %>%
slice(if(n_distinct(Commnets) == 1) 1L else n())
# ID rating Commnets Date
# <chr> <chr> <chr> <date>
#1 H1 C updated 2018-12-10
#2 H1 C+ Add comment for down/upgrading client 2018-11-10
#3 H2 C Add comment for down/upgrading client 2018-11-10
#4 H2 D Add comment for down/upgrading client 2018-11-10
#5 H3 C down 2018-10-10
#6 H3 C+ down 2018-10-02
#7 H4 C Add comment for down/upgrading client 2020-09-03https://stackoverflow.com/questions/66002807
复制相似问题