我是新在R,并希望得到一些帮助,如果可能的话。我正在分析RNAseq数据,为了实现规范化,我需要使用读计数除以列表中每个基因的外显子长度之和(以千字节为单位)。
基本上,我有两个.csv文件:
第二个看起来是这样的:
#EnsemblGeneID #ExonSize
ENSG00000000003 198
ENSG00000000003 188
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 134
ENSG00000000003 83
ENSG00000000003 107
ENSG00000000003 1316
ENSG00000000003 498
ENSG00000000003 188
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 134
ENSG00000000003 27
ENSG00000000003 188
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 46
ENSG00000000003 311
ENSG00000000003 97
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 134
ENSG00000000003 83
ENSG00000000003 107
ENSG00000000003 243
ENSG00000000003 85
ENSG00000000003 48
ENSG00000000003 218
ENSG00000000005 264
ENSG00000000005 131
ENSG00000000005 140
ENSG00000000005 101
ENSG00000000005 153
ENSG00000000005 166
ENSG00000000005 377
ENSG00000000005 411
ENSG00000000005 101
ENSG00000000005 27
ENSG00000000419 187
ENSG00000000419 99
ENSG00000000419 33
ENSG00000000419 76
ENSG00000000419 25
ENSG00000000419 95我想要做的是一个脚本,能够计算每个#EnsemblGeneID的#EnsemblGeneID之和,并创建一个新的.csv文件来存储结果。正如您所看到的,我列表中的每个基因都有不同的外显子号,因此geneID将被列在不同的行中,但我想在最后得到的结果如下:
#EnsemblGeneID #SumExonSize
ENSG00000000003 5121
ENSG00000000005 1871
ENSG00000000419 515有什么帮助吗?
提前感谢
发布于 2018-04-20 11:34:10
如果我正确地理解了您的问题,那么首先您需要将数据加载到数据中,如下所示
df <- read.csv("your_path/input.csv", header=T, stringsAsFactors=F)计算分组和
library(dplyr)
df1 <- df %>%
group_by(EnsemblGeneID) %>%
summarise(SumExonSize = sum(ExonSize))最后使用write.csv将其写入文件中。
write.csv(df1, "your_path/output.csv", row.names = F)当您在dplyr示例数据上运行上述代码时
df <- structure(list(EnsemblGeneID = c("ENSG00000000003", "ENSG00000000003",
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003",
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003",
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003",
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003",
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003",
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003",
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000005",
"ENSG00000000005", "ENSG00000000005", "ENSG00000000005", "ENSG00000000005",
"ENSG00000000005", "ENSG00000000005", "ENSG00000000005", "ENSG00000000005",
"ENSG00000000005", "ENSG00000000419", "ENSG00000000419", "ENSG00000000419",
"ENSG00000000419", "ENSG00000000419", "ENSG00000000419"), ExonSize = c(198L,
188L, 74L, 98L, 134L, 83L, 107L, 1316L, 498L, 188L, 74L, 98L,
134L, 27L, 188L, 74L, 98L, 46L, 311L, 97L, 74L, 98L, 134L, 83L,
107L, 243L, 85L, 48L, 218L, 264L, 131L, 140L, 101L, 153L, 166L,
377L, 411L, 101L, 27L, 187L, 99L, 33L, 76L, 25L, 95L)), .Names = c("EnsemblGeneID",
"ExonSize"), class = "data.frame", row.names = c(NA, -45L))输出为
df1
# EnsemblGeneID SumExonSize
#1 ENSG00000000003 5121
#2 ENSG00000000005 1871
#3 ENSG00000000419 515https://stackoverflow.com/questions/49940551
复制相似问题