文章/答案/技术大牛

发布

社区首页 >问答首页 >在数据集中查找匹配对(或记录)

问在数据集中查找匹配对(或记录)
EN

Stack Overflow用户

提问于 2017-05-05 17:44:20

回答 3查看 107关注 0票数 0

我有一个巨大的数据集，我需要根据一些标准来匹配样本。例如，每一位电影明星在一个地方和自治区找到我两个人(随机)谁不是电影明星。电影明星是1，非电影明星是0.

 location<- c('manhattan', 'manhattan' ,'manhattan', 'manhattan', 'manhattan', 'manhattan')
 moviestar<- c(0,1,0,0,0,1)
 id<- c(1,2,3,4,5,6)
 borough <- c('williamsburg', 'williamsburg', 'williamsburg', 'williamsburg', 'williamsburg','williamsburg')

  df<- data.frame(location,moviestar, borough)

我想要创建一个子集，其中有一对电影明星和另外两个非电影明星(随机挑选)居住在同一地点和行政区。有什么建议吗?基本上有6个人住在曼哈顿，曼哈顿住着两颗恒星，我想匹配每颗恒星，在这种情况下，2和6是恒星，那么我想在最终数据中找到匹配对，如下所示：

我期待的输出是这样的，

  > subset 
  location moviestar borough       id matchpairid
  manhattan    1    williamsburg   2  match1
  manhattan    0    williamsburg   1  match1
  manhttan     0    williamsburg   5  match1
  manhattan    1    williamsburg   6  match2
  manhattan    0    williamsburg   3  match2
  manhttan     0    williamsburg   5  match2

dataframe

dplyr

回答 3

Stack Overflow用户

回答已采纳

发布于 2017-05-05 18:04:11

在data.table中，您可以使用以下方法完成此操作

library(data.table)

setDT(df)[df[, keeper := max(moviestar) == 1, by=.(location, borough)][(keeper),
            if(any(moviestar == 0)) c(sample(.I[moviestar == 0], 2 * sum(moviestar)),
                                             .I[moviestar == 1]), by=.(location, borough)]$V1
          ][, keeper := NULL][]

    location moviestar      borough
1: manhattan         0 williamsburg
2: manhattan         0 williamsburg
3: manhattan         1 williamsburg

饲养员被指派为电影明星区的守护者。然后用它对数据进行子集。在第二个j语句中，检查是否有非电影明星.如果是的话，样本2行的非电影明星(使用.I)的每一个电影明星，也包括电影明星。$V1提取这些指标。将其输入原始数据集以获取结果。

keeper := NULL删除中间守变量，[]在最后打印结果。

票数 0

Stack Overflow用户

发布于 2017-05-05 18:00:46

您可以通过计算每个组中的电影明星和非电影明星的数量，然后根据该条件在每一组中过滤：

library(dplyr)
df %>%
  group_by(location) %>%
  mutate(num_movie_stars = sum(moviestar),
         num_non_movie_stars = sum(1 - moviestar)) %>%
  group_by(location, moviestar) %>%
  filter(moviestar & row_number() <= num_non_movie_stars / 2 |
         !moviestar & row_number() <= num_movie_stars * 2) %>%
  ungroup()

票数 0

Stack Overflow用户

发布于 2017-05-05 18:08:14

还有一个简单的否定答案：

starstruck <- function(location, borough, df){
  subsamp <- df[which(location == df$location & borough == df$borough),]
  stars <- subsamp[subsamp$moviestar == 1,]
  nostars <- subsamp[subsamp$moviestar == 0,]
  randomcombo <- rbind(stars[sample(nrow(stars), 1, F),], 
                       nostars[sample(nrow(nostars), 2, F),])
  randomcombo[order(rownames(randomcombo)),]
}

starstruck("manhattan", "williamsburg", df)
#   location moviestar      borough
#1 manhattan         0 williamsburg
#2 manhattan         1 williamsburg
#3 manhattan         0 williamsburg

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/43810942

复制

相似问题

问在数据集中查找匹配对(或记录)
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在数据集中查找匹配对(或记录)EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在数据集中查找匹配对(或记录)
EN