我有一个巨大的数据集,我需要根据一些标准来匹配样本。例如,每一位电影明星在一个地方和自治区找到我两个人(随机)谁不是电影明星。电影明星是1,非电影明星是0.
location<- c('manhattan', 'manhattan' ,'manhattan', 'manhattan', 'manhattan', 'manhattan')
moviestar<- c(0,1,0,0,0,1)
id<- c(1,2,3,4,5,6)
borough <- c('williamsburg', 'williamsburg', 'williamsburg', 'williamsburg', 'williamsburg','williamsburg')
df<- data.frame(location,moviestar, borough)我想要创建一个子集,其中有一对电影明星和另外两个非电影明星(随机挑选)居住在同一地点和行政区。有什么建议吗?基本上有6个人住在曼哈顿,曼哈顿住着两颗恒星,我想匹配每颗恒星,在这种情况下,2和6是恒星,那么我想在最终数据中找到匹配对,如下所示:
我期待的输出是这样的,
> subset
location moviestar borough id matchpairid
manhattan 1 williamsburg 2 match1
manhattan 0 williamsburg 1 match1
manhttan 0 williamsburg 5 match1
manhattan 1 williamsburg 6 match2
manhattan 0 williamsburg 3 match2
manhttan 0 williamsburg 5 match2发布于 2017-05-05 18:04:11
在data.table中,您可以使用以下方法完成此操作
library(data.table)
setDT(df)[df[, keeper := max(moviestar) == 1, by=.(location, borough)][(keeper),
if(any(moviestar == 0)) c(sample(.I[moviestar == 0], 2 * sum(moviestar)),
.I[moviestar == 1]), by=.(location, borough)]$V1
][, keeper := NULL][]
location moviestar borough
1: manhattan 0 williamsburg
2: manhattan 0 williamsburg
3: manhattan 1 williamsburg饲养员被指派为电影明星区的守护者。然后用它对数据进行子集。在第二个j语句中,检查是否有非电影明星.如果是的话,样本2行的非电影明星(使用.I)的每一个电影明星,也包括电影明星。$V1提取这些指标。将其输入原始数据集以获取结果。
keeper := NULL删除中间守变量,[]在最后打印结果。
发布于 2017-05-05 18:00:46
您可以通过计算每个组中的电影明星和非电影明星的数量,然后根据该条件在每一组中过滤:
library(dplyr)
df %>%
group_by(location) %>%
mutate(num_movie_stars = sum(moviestar),
num_non_movie_stars = sum(1 - moviestar)) %>%
group_by(location, moviestar) %>%
filter(moviestar & row_number() <= num_non_movie_stars / 2 |
!moviestar & row_number() <= num_movie_stars * 2) %>%
ungroup()发布于 2017-05-05 18:08:14
还有一个简单的否定答案:
starstruck <- function(location, borough, df){
subsamp <- df[which(location == df$location & borough == df$borough),]
stars <- subsamp[subsamp$moviestar == 1,]
nostars <- subsamp[subsamp$moviestar == 0,]
randomcombo <- rbind(stars[sample(nrow(stars), 1, F),],
nostars[sample(nrow(nostars), 2, F),])
randomcombo[order(rownames(randomcombo)),]
}
starstruck("manhattan", "williamsburg", df)
# location moviestar borough
#1 manhattan 0 williamsburg
#2 manhattan 1 williamsburg
#3 manhattan 0 williamsburghttps://stackoverflow.com/questions/43810942
复制相似问题