文章/答案/技术大牛

发布

社区首页 >问答首页 >在R中交叉多个数据帧

问在R中交叉多个数据帧
EN

Stack Overflow用户

提问于 2015-03-31 00:18:46

回答 1查看 271关注 0票数 0

StackOverflow问题

伙计们，你们好，

我正在尝试用R“交叉”多个数据帧。

我的数据帧来自高通量测序实验，如下所示：

df1：

         chr  pos orient weight in_nucleosome in_subtelo
1  NC_001133  999      +      1          TRUE       TRUE
2  NC_001133 1505      -     14         FALSE       TRUE
3  NC_001133 1525      -      2          TRUE       TRUE
4  NC_001134  480      +      1          TRUE       TRUE
5  NC_001134  509      +      2         FALSE       TRUE
6  NC_001134  539      +      3         FALSE       TRUE
7  NC_001135 1218      +      1          TRUE       TRUE
8  NC_001135 1228      +      2          TRUE       TRUE
9  NC_001135 1273      +      1          TRUE       TRUE
10 NC_001136  362      +      1          TRUE       TRUE

和

df2：

         chr                feature  start    end orient
1  NC_001133                    ARS    707    776      .
2  NC_001133                    ARS   7997   8547      .
3  NC_001133                    ARS  30946  31183      .
4  NC_001133 ARS_consensus_sequence  31002  31018      +
5  NC_001133 ARS_consensus_sequence  70418  70434      -
6  NC_001133 ARS_consensus_sequence 124463 124479      -
7  NC_001136  blocked_reading_frame 721071 721481      -
8  NC_001137  blocked_reading_frame 375215 377614      -
9  NC_001141  blocked_reading_frame  29032  30048      +
10 NC_001133                    CDS    335    649      +

我想要做的是知道给定的染色体(这里是“chr”)和每个df2$特征是否(df2$start < df1$pos < df2$end)。然后，我想向df1添加一列，它的名称将是所考虑的df2feature之一，并根据前面所述的条件用TRUE或FALSE填充。

我非常确定必须使用apply系列函数，可能会嵌套在一个antoher中，但经过几个小时的尝试后，我无法做到这一点。

我用一种非常不优雅、冗长且容易出错的方式使用嵌套的for循环，但我相信有一个更好、更简单、可能更快的解决方案。

感谢你阅读这篇文章，

安托万。

dataframe

回答 1

Stack Overflow用户

发布于 2015-03-31 01:12:43

虽然使用dplyr可能是可行的(我尝试过，但不是很熟练)，但我(我认为)可以使用foreach和iterators

您的数据：

df1 <- structure(list(chr = c("NC_001133", "NC_001133", "NC_001133", "NC_001134", "NC_001134", "NC_001134", "NC_001135", "NC_001135", "NC_001135", "NC_001136"),
                      pos = c(999L, 1505L, 1525L, 480L, 509L, 539L, 1218L, 1228L, 1273L, 362L),
                      orient = c("+", "-", "-", "+", "+", "+", "+", "+", "+", "+"),
                      weight = c(1L, 14L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 1L),
                      in_nucleosome = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE),
                      in_subtelo = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)),
                 .Names = c("chr", "pos", "orient", "weight", "in_nucleosome", "in_subtelo"),
                 class = "data.frame",
                 row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

df2 <- structure(list(chr = c("NC_001133", "NC_001133", "NC_001133", "NC_001133", "NC_001133", "NC_001133", "NC_001136", "NC_001137", "NC_001141", "NC_001133"),
                      feature = c("ARS", "ARS", "ARS", "ARS_consensus_sequence", "ARS_consensus_sequence", "ARS_consensus_sequence", "blocked_reading_frame", "blocked_reading_frame", "blocked_reading_frame", "CDS"),
                      start = c(707L, 7997L, 30946L, 31002L, 70418L, 124463L, 721071L, 375215L, 29032L, 335L),
                      end = c(776L, 8547L, 31183L, 31018L, 70434L, 124479L, 721481L, 377614L, 30048L, 649L),
                      orient = c(".", ".", ".", "+", "-", "-", "-", "-", "+", "+")),
                 .Names = c("chr", "feature", "start", "end", "orient"),
                 class = "data.frame",
                 row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

由于我认为您的数据没有任何匹配，因此我将注入一些：

## to be able to find *something*
df1$pos <- c(999, 1505, 8000, 480, 509, 539, 1218, 1228, 1272, 721072)

代码：

library(foreach)
library(iterators)

## pre-populate df1 with necessary columns
for (col in unique(df2$feature)) df1[,col] <- FALSE

df1a <- foreach (subdf1 = iter(df1, by='row'), .combine=rbind) %do% {
    features <- unique(df2$feature[df2$chr== subdf1$chr])
    for (feature in features) {
        idx <- (df2$chr == subdf1$chr) & (feature == df2$feature)
        if (length(idx)) {
            subdf1[feature] <- any((df2$start[idx] < subdf1$pos) & (subdf1$pos < df2$end[idx]))
        }
    }
    subdf1
}

df1a
##          chr    pos orient weight in_nucleosome in_subtelo   ARS
## 1  NC_001133    999      +      1          TRUE       TRUE FALSE
## 2  NC_001133   1505      -     14         FALSE       TRUE FALSE
## 3  NC_001133   8000      -      2          TRUE       TRUE  TRUE
## 4  NC_001134    480      +      1          TRUE       TRUE FALSE
## 5  NC_001134    509      +      2         FALSE       TRUE FALSE
## 6  NC_001134    539      +      3         FALSE       TRUE FALSE
## 7  NC_001135   1218      +      1          TRUE       TRUE FALSE
## 8  NC_001135   1228      +      2          TRUE       TRUE FALSE
## 9  NC_001135   1272      +      1          TRUE       TRUE FALSE
## 10 NC_001136 721072      +      1          TRUE       TRUE FALSE
##    ARS_consensus_sequence blocked_reading_frame   CDS
## 1                   FALSE                 FALSE FALSE
## 2                   FALSE                 FALSE FALSE
## 3                   FALSE                 FALSE FALSE
## 4                   FALSE                 FALSE FALSE
## 5                   FALSE                 FALSE FALSE
## 6                   FALSE                 FALSE FALSE
## 7                   FALSE                 FALSE FALSE
## 8                   FALSE                 FALSE FALSE
## 9                   FALSE                 FALSE FALSE
## 10                  FALSE                  TRUE FALSE

使用foreach和iterators的一个简单副作用是，如果数据很大，而您使用的是doParallel，那么只需将%do%替换为%dopar%，事情就会像您定义的那样并行。你可以用下面这样的东西来开始上面的所有内容：

library(doParallel)
cl <- makeCluster(detectCores() - 1) # leaving one available is "A Good Thing (tm)"
registerDoParallel(cl)

## replace %do% with %dopar%, do all of the above code

## clean up
stopCluster(cl)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/29351222

复制

相似问题

问在R中交叉多个数据帧
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R中交叉多个数据帧EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R中交叉多个数据帧
EN