下面我已经提到了数据格式:
DF <- read.table(text = "
Date ID
2018-04-01 K-1
2018-04-01 K-1
2018-04-01 K-8
2018-04-02 K-2
2018-04-02 K-2
2018-04-03 K-2
2018-04-03 K-2
2018-04-03 K-2
2018-04-04 K-3
2018-05-01 K-5
2018-05-01 K-5
2018-05-02 K-6
2018-05-02 K-7", header = TRUE, stringsAsFactors = FALSE)通过使用上面提到的数据块,我想确定下面提到的度量:
Date Unique_count Duplicate_Count Overall_Duplicate
2018-04-01 2 1 0
2018-04-02 1 1 0
2018-04-03 0 0 3
2018-04-04 1 0 0
2018-05-01 1 1 0
2018-05-02 2 0 0其中:
Unique_count-为特定日期创建的ID ID不应与任何以前的ID相匹配。Duplicate_count -计算额外的(如果有2个比- Duplicate_count应该是1)为特定日期生成的相同的ID,相同的ID不应该与任何以前的ID匹配。Overall_Duplicate -以前生成并在特定日期再次出现的ID计数。我有下面提到的代码,不确定关于Overall_Duplicate
library(dplyr)
DF2 <- DF %>%
group_by(Date) %>%
summarise(Unique_Count = n_distinct(ID),
Duplicate_Count = sum(table(ID)>1))发布于 2018-06-12 14:42:34
如果首先按ID分组并发现第一次出现每个ID,则可以将所有后续ID(第一次出现后)更改为NA,然后进行一些计算以得到所需的结果。
DF %>%
group_by(ID) %>%
mutate(first_time = min(Date)) %>%
ungroup() %>%
mutate(ID = ifelse(Date == first_time, ID, NA)) %>%
group_by(Date) %>%
summarise(Unique_Count = n_distinct(ID, na.rm = TRUE),
Overall_Duplicate = sum(is.na(ID)),
Duplicate_Count = n() - Unique_Count - Overall_Duplicate)发布于 2018-06-12 08:13:55
你可以试试像这样的。但这是硬编码根据你的可复制数据。在现实生活中,可能会出现假阳性。
library(tidyverse)
DF %>%
mutate(id=duplicated(ID)) %>%
group_by(ID) %>%
mutate(OLdate=n_distinct(Date)>1) %>%
group_by(Date) %>%
summarise(Unique_count=sum(!na_if(id, T), na.rm = T),
Duplicate_Count=ifelse(Unique_count == 0, 0, sum(id)),
Overall_Duplicate=ifelse(Unique_count == 0,sum(OLdate), 0))
# A tibble: 6 x 4
Date Unique_count Duplicate_Count Overall_Duplicate
<chr> <int> <dbl> <dbl>
1 2018-04-01 2 1 0
2 2018-04-02 1 1 0
3 2018-04-03 0 0 3
4 2018-04-04 1 0 0
5 2018-05-01 1 1 0
6 2018-05-02 2 0 0https://stackoverflow.com/questions/50810533
复制相似问题