我有一个不平衡的数据集,来自自由和保守背景的人在一个问题上给出了评级(1-7)。想看看这个问题有多两极分化。
样本严重偏向自由派(占样本的70%)。如何使用R进行重复采样以创建平衡样本(50-50)并计算峰度?
例如,我总共有50名保守派人士。我如何重复地从150名自由主义者中随机抽取50名?
下面是一个示例数据帧:
political_ort rating
liberal 1
liberal 6
conservative 5
conservative 3
liberal 7
liberal 3
liberal 1发布于 2021-01-29 10:15:33
你所描述的被称为“欠采样”。以下是使用tidyverse函数的一种方法:
# Load library
library(tidyverse)
# Create some 'test' (fake) data
sample_df <- data_frame(id_number = (1:100),
political_ort = c(rep("liberal", 70),
rep("conservative", 30)),
ratings = sample(1:7, size = 100, replace = TRUE))
# Take the fake data
undersampled_df <- sample_df %>%
# Group the data by category (liberal / conservative) to treat them separately
group_by(political_ort) %>%
# And randomly sample 30 rows from each category (liberal / conservative)
sample_n(size = 30, replace = FALSE) %>%
# Because there are only 30 conservatives in total they are all included
# Finally, ungroup the data so it goes back to a 'vanilla' dataframe/tibble
ungroup()
# You can see the id_numbers aren't in order anymore indicating the sampling was random还有一个ROSE包,它有一个函数("ovun.sample")可以为您做这件事:https://www.rdocumentation.org/packages/ROSE/versions/0.0-3/topics/ovun.sample
https://stackoverflow.com/questions/65946805
复制相似问题