我正在运行一个自定义函数,它为data.frame的每一行生成并保存一些情节:
zz <- "chr start end name max
chr11 66184332 66197785 NPAS4_CBP20 90
chr11 62666002 62683613 BC047540_CBP20 100
chr1 9824542 9828548 MIR3687_CBP20 500
chr6 33239767 33259282 B3GALT4_Pol2 1000
chr20 244996112 245029580 HNRNPU-AS1_Pol2 450
chr20 62487823 62525914 ABHD16B_Pol2 370
chr12 121146198 121179996 ACADS_Pol2 90"
my.genes <- read.table(text=zz, header = TRUE)该函数有两个步骤,一个是慢的(~40秒),但只需要对每个chr运行一个,然后对该chr的每一行运行。在伪码中,应该是这样的:
for each chr createData
for each nameInChr do somethingWithData.我的问题是,如何优化这一点?嵌套d_ply?按照chr对data.frame进行排序,然后对唯一的应用(mygene$chr)进行某种形式的应用?
如果不包含实际的功能,我很抱歉,但这是一个很长的问题,我认为这是一个更好的实践/理论问题。但是,如果需要,我可以添加适当的代码。
编辑
这里是简化函数(plotGviz)。实际上,我不认为d_ply会工作,因为它会对每一行运行UcscTrack (我是正确的吗?)最终的目标是对每个唯一的染色体运行UcscTrack,然后使用这些信息为每个基因(名称)构建小区。这两个步骤(UcscTrack和绘图)都很费时,但基本原理是,首先运行代码的非唯一部分(适用于chr),可以节省整个表的时间--该表是我的一小部分。
自定义plotGviz中的函数不能仅以我设置它们的方式进行优化。
################
### libraries ##
library(Gviz)
library(GenomicRanges)
library(GenomicFeatures)
library(data.table)
library("RColorBrewer")
#######################
d_ply(my.genes, .(chr, name), plotGviz)
plotGviz <- function(gene) {
chr <- gene$chr
mygene.start <- gene$start
mygene.end <- gene$end
mygene <- gene$name
max.cov <- gene$max
#################################################
## this part runs only once for each unique chr
## UcscTrack is what takes most time
## Gene annotations ##
knownGenes <- UcscTrack(genome=gen, chromosome=chr,
track="knownGene", trackType="GeneRegionTrack",
rstarts="exonStarts", rends="exonEnds", gene="name", symbol="name",
transcript="name", strand="strand", fill="#FF7F00", name="UCSC Genes", geneSymbols = TRUE, showId = TRUE)
refGenes <- UcscTrack(genome=gen, chromosome=chr,
track="refGene", trackType="GeneRegionTrack",
rstarts="exonStarts", rends="exonEnds", gene="name", symbol="name",
transcript="name", strand="strand", fill="#FF7F00", name="RefSeq Genes", geneSymbols = TRUE, showId = TRUE)
## axis scale ##
ideoTrack <- IdeogramTrack(genome = gen, chromosome = chr)
# plotTracks(ideoTrack, from = mygene.start, to = mygene.end)
axisTrack <- GenomeAxisTrack(scale=0.25)
####################################
## now for each name in unique chr
######################
## construct tracks ##
## ChIP-seq coverage from BAM ##
bamPol2 <- "~/bioinfo/srp_chip_seq/data/reads/merged_reads_CBP20line/Pol2.bam"
bamInput <- "~/bioinfo/srp_chip_seq/data/reads/merged_reads_CBP20line/Input.bam"
## histogram
bTrackPol2 <- DataTrack(range = bamPol2, genome = gen, chromosome = chr,
name = "Pol2", type = "histogram", col.histogram="#984EA3",
ylim=c(0,max.cov))
bTrackInput <- DataTrack(range = bamInput, genome = gen, chromosome = chr,
name = "Input", type = "histogram", col.histogram="#999999",
ylim=c(0,max.cov))
#################
## plot tracks ##
pdf(paste("./plots/", mygene,"_cov.pdf", sep=""))
plotTracks(list(ideoTrack, axisTrack, bTrackCBP20, pTrackCBP20, bTrackPol2, pTrackPol2, bTrackInput, refGenes, bTrackRNAcov), from = mygene.start, to = mygene.end, fontfamily="Helvetica", background.title="white", col.title="black", col.axis="black")
plotTracks(list(ideoTrack, axisTrack, bTrackCBP20, pTrackCBP20, bTrackPol2, pTrackPol2, bTrackInput, knownGenes, bTrackRNAcov), from = mygene.start, to = mygene.end, fontfamily="Helvetica", background.title="white", col.title="black", col.axis="black")
dev.off()
# plotTracks(list(axisTrack, refGenes, bTrackCBP20, pTrackCBP20, bTrackPol2, pTrackPol2, bTrackInput, bTrackRNA), from = mygene.start, to = mygene.end, fontfamily="Helvetica", background.title="white", col.title="black", col.axis="black")
# head(displayPars(bTrackPol2))
}edit2
我的临时解决方案使用唯一chr的for循环,在子设置每个chr的所有名称之前创建UcscTrack,并为绘图创建一个d_pply。基本上是在两个地方吐出这个大的功能。
for (chrom in unique(my.genes$chr)) {
print(paste("creating annotation for ", chrom, sep=""))
## Gene annotations ##
knownGenes <- UcscTrack(genome=gen, chromosome=chrom,
track="knownGene", trackType="GeneRegionTrack",
rstarts="exonStarts", rends="exonEnds", gene="name", symbol="name",
transcript="name", strand="strand", fill="#FF7F00", name="UCSC Genes", geneSymbols = TRUE, showId = TRUE)
refGenes <- UcscTrack(genome=gen, chromosome=chrom,
track="refGene", trackType="GeneRegionTrack",
rstarts="exonStarts", rends="exonEnds", gene="name", symbol="name",
transcript="name", strand="strand", fill="#FF7F00", name="RefSeq Genes", geneSymbols = TRUE, showId = TRUE)
## axis scale ##
ideoTrack <- IdeogramTrack(genome = gen, chromosome = chrom)
# plotTracks(ideoTrack, from = mygene.start, to = mygene.end)
axisTrack <- GenomeAxisTrack(scale=0.25)
## select genes in chromosome
df.g <- subset(my.genes, chr == chrom)
d_ply(df.g, .(name), plotGviz)
}发布于 2013-08-08 16:40:21
试试data.table,它会比plyr快。下面是解决问题的一般方法:
library(data.table)
dt = data.table(my.genes)
dt[, .SD[, do_smth(), by = name], by = chr]下面是一个不同问题中的上述方法的示例-- how to integrate properties defined on multiple rows using a data.frame or data.table long format approach
https://stackoverflow.com/questions/18125178
复制相似问题