我想用4个变量对数据帧进行分组,用计数汇总它,然后计算每行占总计数的百分比,比较第一个变量每组中的总计数。作为最后一步,我计算了一个累积百分比,并根据某些阈值将行分配到一个类别。
首先给出一个简单的例子:
library(nycflights13)
library(dplyr)
test <- flights %>%
left_join(airlines, by = c('carrier'), na_matches = "never") %>%
group_by(origin, name) %>%
summarise(count_flights = n()) %>%
arrange(origin, desc(count_flights)) %>%
mutate(prop = prop.table(count_flights) * 100,
cumprop = cumsum(prop),
ABC = cut(cumprop, c(0,80,95,100), labels = c('A','B','C')))这很好用,我得到了每个纽约市机场和航空公司的航班数量,以及每一行占机场总数的百分比。
现在,当按两个以上的变量分组时,这不起作用:
test2 <- flights %>%
left_join(airlines, by = c('carrier'), na_matches = "never") %>%
group_by(origin, name, dest, day) %>%
summarise(count_flights = n()) %>%
arrange(origin, desc(count_flights)) %>%
mutate(prop = prop.table(count_flights) * 100,
cumprop = cumsum(prop),
ABC = cut(cumprop, c(0,80,95,100), labels = c('A','B','C')))我期望的是在更改机场/始发地之前,累积和等于100,或者换一种说法,即每行占每个机场总航班的百分比。
有什么想法吗?
发布于 2020-07-23 15:44:59
最好的方法是对要用于新的、不太具体的存储桶(源)的变量进行group_by,然后在mutate中将计数除以总计数
flights %>%
left_join(airlines, by = c('carrier'), na_matches = "never") %>%
group_by(origin, name, dest, day) %>%
summarise(count_flights = n()) %>%
arrange(origin) %>%
group_by(origin) %>%
mutate(prop = count_flights/sum(count_flights),
cumprop = cumsum(prop))https://stackoverflow.com/questions/63049156
复制相似问题