假设我有这样一个data.frame:
user_df = read.table(text = "id industry year
1 Government 1999
2 Government 1999
3 Government 1999
4 Private 1999
5 NGO 1999
1 Government 2000
2 Government 2000
3 Government 2000
4 Government 2000
1 Government 2001
5 Government 2001
2 Private 2001
3 Private 2001
4 Private 2001", header = T)对于每个用户,我都有一个唯一的id、industry和year。
我试图计算曾经在政府工作过的人的累计数量,因此,累积统计应该是对该年和前几年唯一用户总数的统计。
我知道我可以像这样做一个普通的累积总和:
user_df %>% group_by(year, industry) %>% summarize(cum_sum = cumsum(n_distinct(id)))
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 2
6 2001 Private 3然而,这不是我想要的,因为今年的2000和2001将包括已经包括在1999中的人。我希望每年都能累计统计在某一年曾在政府工作过的独特用户的总数。我无法在dplyr中找到正确的方法来做到这一点。
因此正确的输出应该如下所示:
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3发布于 2021-03-08 21:22:11
一种选择可能是:
user_df %>%
group_by(industry) %>%
mutate(cum_sum = cumsum(!duplicated(id))) %>%
group_by(year, industry) %>%
summarise(cum_sum = max(cum_sum))
year industry cum_sum
<int> <fct> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3发布于 2021-03-08 21:22:32
1) sqldf,这可以通过在sql中实现复杂的自连接来实现。它将每行连接到具有相同行业和相同年份或之前的行,然后按年份对它们进行分组,并对计算不同id的行业进行分组。
library(sqldf)
sqldf("select a.year, a.industry, count(distinct b.id) cum_sum
from user_df a
left join user_df b on b.industry = a.industry and b.year <= a.year
group by a.year, a.industry")给予:
year industry cum_sum
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 32)基础--一个基础解决方案是将工业上的数据框架与自身进行合并,然后将子集合并到同一年或更早的年份,然后按行业和年份进行聚合。这是低效的,因为与SQL语句不同,SQL语句在加入时过滤,这会在过滤之前创建整个联接;但是,如果数据不太大,这可能就足够了。
m <- merge(user_df, user_df, by = "indstry")
s <- subset(m, year.y <= year.x)
ag <- aggregate(id.y ~ industry + year.x, s, function(x) length(unique(x)))
names(ag) <- sub("\\..*", "", names(ag))
ag给予:
industry year id
1 Government 1999 3
2 NGO 1999 1
3 Private 1999 1
4 Government 2000 4
5 Government 2001 5
6 Private 2001 3https://stackoverflow.com/questions/66537151
复制相似问题