我有一份ATAC序列窄峰的档案。与PEPATAC流水线类似,这里的算法是:首先,保留最重要的峰值,并删除与该显著峰值直接重叠的任何峰值。然后,此过程迭代到下一个最重要的峰值,依此类推,直到由于与更重要的峰值直接重叠而保留或移除了所有峰值。
以下是我的数据示例
data <- structure(list(V1 = c("chr3", "chrUn_KI270467v1", "chr8", "chrUn_KI270467v1",
"chr21", "chr7"), V2 = c(93470281L, 1668L, 109333171L, 1668L,
14382415L, 12686167L), V3 = c(93470873L, 3946L, 109335050L, 3946L,
14384230L, 12688127L), V4 = c("Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_60893",
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_97388b", "Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_92241d",
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_97388c", "Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_55158c",
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_83188b"), V5 = c(91371L,
24480L, 19002L, 17131L, 17084L, 16639L), V6 = c(".", ".", ".",
".", ".", "."), V7 = c(726.76721, 240.65007, 195.49055, 179.63454,
179.23312, 175.41965), V8 = c(9144.74609, 2454.88721, 1906.9408,
1719.66797, 1714.96497, 1670.38489), V9 = c(9137.11816, 2448.07471,
1900.24915, 1713.15186, 1708.45618, 1663.93958), V10 = c(272L,
666L, 1082L, 1445L, 898L, 525L)), row.names = c(88715L, 141209L,
133771L, 141210L, 80584L, 120831L), class = "data.frame")我根据它们的重要性对峰值进行了排序。到目前为止,我写了这段代码
for ( i in 1:nrow(B1.1)){
sub<-as_granges(B1.1[i , ], seqnames=1, start=2 , end=3)
que<- as_granges(B1.1 , seqnames=V1 , start=V2 , end=V3)
overlaps<-which(overlapsAny( que, sub))
overlaps<-overlaps[overlaps!=i ]
B1.1<-B1.1[ ! seq(from=1 , to=nrow(B1.1), by=1)%in%overlaps, ]
overlaps<-c()
}显然,这段代码效率不是很高,而且需要花费大量时间。
非常感谢您的建议
发布于 2020-12-18 08:39:50
您为间隔定义了组,并对您的GRanges对象进行了重复数据删除,因为我不能用您的数据显示它,下面是一个根据p值排序的示例数据:
library(GenomicRanges)
gr = GRanges(seqnames=1,
IRanges(start=c(1,1,1,1,5,5),end=c(3,3,2,2,8,9)),
p=sort(runif(6,0,0.1)))使用reduce定义组,首先创建一个包含所有重叠的宽间隔:
rgr = reduce(gr)
rgr
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 1-3 *
[2] 1 5-9 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths现在使用findOverlaps将各个范围分配给组
gr$group = subjectHits(findOverlaps(gr,rgr))对此进行重复数据删除:
gr[!duplicated(gr$group)]
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | p group
<Rle> <IRanges> <Rle> | <numeric> <integer>
[1] 1 1-3 * | 0.0276333122281358 1
[2] 1 5-8 * | 0.0465503185754642 2
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths另一种方法是使用while循环:
findNonOvlp = function(g){
idx = seq_along(g)
keep = c()
while(length(idx)>0){
top_g = idx[1]
rmv = idx[countOverlaps(g[idx],g[top_g])>0]
idx = setdiff(idx,rmv)
keep = c(keep,top_g)
}
g[keep]
} 测试:
findNonOvlp(gr)
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | p
<Rle> <IRanges> <Rle> | <numeric>
[1] 1 1-3 * | 0.00038885809481144
[2] 1 5-8 * | 0.058683192031458
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths再举一个例子:
r = GRanges(seqnames=c("chr1", "chr1", "chr1"), IRanges(start=c(100,140,180),end=c(150,200, 200) ) )
findNonOvlp(r)
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 100-150 *
[2] chr1 180-200 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengthshttps://stackoverflow.com/questions/65348911
复制相似问题