在下面的数据集中,我想做两件事
pt_id <- c(1,1,1,1,1,2,2,2,3,3,3,3,3,4,4,4,4)
Tobacco <- c("once","twice","never", NA, NA, NA, NA, NA,"Once","Twice","Quit","Once",NA,NA,"Never", NA, "Never")
Alcohol <- c("twice", "once",NA, NA, "never", NA, NA, "Once", NA, "Quit", "Twice", NA, "Once", NA, NA, "Never", "Never")
PA <- c("once",NA,"never", NA, NA, NA, NA, NA,"Once",NA,"Quit","Once",NA,NA,"Never", NA, NA)
mydata <- data.frame(pt_id, Tobacco, Alcohol, PA)
mydata分组的行数。
我使用了下面的代码来获得输出,但每次只能对一个变量进行输出。
mydata_tob <- mydata %>%
filter(!is.na(Tobacco)) %>%
group_by(pt_id) %>%
count()
# A tibble: 3 x 2
# Groups: pt_id [3]
pt_id n
<dbl> <int>
1 1 3
2 3 4
3 4 2但这对我来说非常费时,因为我的原始数据集中有许多变量。我想对所有变量一次输出类似的输出。
,
gt1_prop <- function(n) {
gt1_len <- length(mydata_tob$n[mydata_tob$n > 1])
len_tot <- length(mydata_tob$n)
gt1_prop <- (gt1_len/ len_tot)*100
return(gt1_prop)
}同样,我想以一种方式进行编码,以获得数据集中每个变量(烟草、酒精和PA)的比例。
任何建议都会有帮助。提前感谢!
发布于 2020-12-06 09:23:50
要计算每个pt_id的非NA值的数目,可以使用across。
library(dplyr)
mydata %>%
group_by(pt_id) %>%
summarise(across(Tobacco:PA, ~sum(!is.na(.)))) -> result
result
# pt_id Tobacco Alcohol PA
# <dbl> <int> <int> <int>
#1 1 3 3 2
#2 2 0 1 0
#3 3 4 3 3
#4 4 2 2 1对于计算百分比的第二步,您可以:
result %>%
summarise(across(Tobacco:PA, ~mean(. > 1) * 100))
# Tobacco Alcohol PA
# <dbl> <dbl> <dbl>
#1 0.75 0.75 0.5发布于 2020-12-06 18:58:29
在base R中,我们可以
aggregate(.~ pt_id, mydata, FUN = function(x) sum(!is.na(x)), na.action = NULL)-output
# pt_id Tobacco Alcohol PA
#1 1 3 3 2
#2 2 0 1 0
#3 3 4 3 3
#4 4 2 2 1或者更简洁地使用来自base R的base R
rowsum(+(!is.na(mydata[-1])), mydata$pt_id)
# Tobacco Alcohol PA
#1 3 3 2
#2 0 1 0
#3 4 3 3
#4 2 2 1如果我们需要百分比
colMeans(rowsum(+(!is.na(mydata[-1])), mydata$pt_id) > 1)
#Tobacco Alcohol PA
# 0.75 0.75 0.50 https://stackoverflow.com/questions/65166418
复制相似问题