这是我的大数据的一个子集:
gene feature reads
A anot 2
A 3ss_A 3
A 3ss_B 5
B 5ss_A 1
B anot 4
C 3ss_A 2
C 3ss_B 8
C anot 3
C 5ss_A 6我想将每个基因中对应于3ss和5ss特征的读数划分为该基因的"anot“特征。我对每个基因都有多个特征(这里没有显示),但每个基因只有一个"anot“特征。
预期输出为:
gene feature reads ratio
A anot 2 1
A 3ss_A 3 1.5
A 3ss_B 5 2.5
B 5ss_A 1 0.25
B anot 4 1
C 3ss_A 2 0.666666667
C 3ss_B 8 2.666666667
C anot 3 1
C 5ss_A 6 2我怎么能在R中做到这一点?谢谢
发布于 2016-04-16 20:45:07
以下是各种替代方案:
1) ave像这样使用ave。向函数fun传递一个基因的行号向量,并返回该基因的比率向量。不使用任何包。
fun <- function(ix) with(DF[ix, ], reads / reads[feature == "anot"])
transform(DF, ratio = ave(1:nrow(DF), gene, FUN = fun))给予:
gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.00000001a) ave这里是使用ave的另一种方法。它将每个非anot读数替换为NA,然后在每个基因中使用na.omit将读数除以非NA
transform(DF, ratio =
reads / ave(ifelse(feature == "anot", reads, NA), gene, FUN = na.omit))给予:
gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.00000001b) ave这里是另一个ave变体。此示例特别简洁,但确实假设anot的reads值始终为非负(问题示例中就是这种情况)。它为anot创建一个等于reads的向量,否则为零,然后取最大值:
transform(DF, ratio = reads / ave((feature == "anot") * reads, gene, FUN = max))给予:
gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.00000002) by另一种选择,也不使用任何包,是使用by。在这里,函数funby接受DF的行子集,并返回附加了比率的子集。
funby <- function(x) transform(x, ratio = reads / reads[feature == "anot"])
do.call("rbind", by(DF, DF$gene, funby))给予:
gene feature reads ratio
A.1 A anot 2 1.0000000
A.2 A 3ss_A 3 1.5000000
A.3 A 3ss_B 5 2.5000000
B.4 B 5ss_A 1 0.2500000
B.5 B anot 4 1.0000000
C.6 C 3ss_A 2 0.6666667
C.7 C 3ss_B 8 2.6666667
C.8 C anot 3 1.0000000
C.9 C 5ss_A 6 2.00000003) rep/table这也不使用包。它假设DF是按基因排序的(问题中的示例就是这种情况)。它重复每个anot读取该基因中的行数,然后将reads除以行数。
transform(DF, ratio = reads / rep(reads[feature == "anot"], table(gene)))给予:
gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.00000004)使用dplyr包的dplyr:
library(dplyr)
DF %>%
group_by(gene) %>%
mutate(ratio = reads / reads[feature == "anot"]) %>%
ungroup()给予:
Source: local data frame [9 x 4]
gene feature reads ratio
(fctr) (fctr) (int) (dbl)
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.00000005)使用data.table包的data.table:
library(data.table)
DT <- as.data.table(DF)
DT[, ratio := reads / reads[feature == "anot"], by = "gene"]给予:
> DT
gene feature reads ratio
1: A anot 2 1.0000000
2: A 3ss_A 3 1.5000000
3: A 3ss_B 5 2.5000000
4: B 5ss_A 1 0.2500000
5: B anot 4 1.0000000
6: C 3ss_A 2 0.6666667
7: C 3ss_B 8 2.6666667
8: C anot 3 1.0000000
9: C 5ss_A 6 2.0000000注意:可复制形式的输入DF为:
Lines <- "gene feature reads
A anot 2
A 3ss_A 3
A 3ss_B 5
B 5ss_A 1
B anot 4
C 3ss_A 2
C 3ss_B 8
C anot 3
C 5ss_A 6"
DF <- read.table(text = Lines, header = TRUE)发布于 2016-04-16 20:41:21
你可以尝试像这样的东西
anot_reads <- yourdata[yourdata$feature == "anot",]$reads
names(anot_reads) <- yourdata[yourdata$feature == "anot",]$gene
yourdata$ratio <- yourdata$reads / anot_reads[yourdata$gene]发布于 2016-04-16 20:51:34
您可以在基数R中使用:
df$ratio <- unlist(sapply(levels(df$gene),
function(l) with(subset(df, gene==l), reads / reads[feature=="anot"])))
gene feature reads ratio
1 A anot 2 1.0000000
2 A 3ss_A 3 1.5000000
3 A 3ss_B 5 2.5000000
4 B 5ss_A 1 0.2500000
5 B anot 4 1.0000000
6 C 3ss_A 2 0.6666667
7 C 3ss_B 8 2.6666667
8 C anot 3 1.0000000
9 C 5ss_A 6 2.0000000它翻译为:沿着gene的级别应用:子集df,将reads除以feature==anot的reads值。然后对结果执行unlist操作,并在data.frame中创建新列。
但可能还有一个更短的选择。
https://stackoverflow.com/questions/36664174
复制相似问题