我正在尝试从GWAS分析的输出中获得按地区划分的领先SNP。我需要做的是找到附近所有SNP中p值最低的行。下面是我的数据。(还有其他列,但对此选择无关紧要)。
# A tibble: 9 x 4
SNP chr POS Wald.pval
<chr> <int> <int> <dbl>
1 4:31970946:T:C 4 31970946 0.000000620
2 4:32001877:A:C 4 32001877 0.000000707
3 10:4627686:T:C 10 4627686 0.000000296
4 11:109015746:A:T 11 109015746 0.000000634
5 11:109018337:C:T 11 109018337 0.000000487
6 11:109018391:G:C 11 109018391 0.000000487
7 11:109019179:G:C 11 109019179 0.000000824
8 15:52448759:A:G 15 52448759 0.000000471
9 16:73596272:C:T 16 73596272 0.000000493\这是我想要的输出。这些是位置在一定距离内的每个区域的最低pval。这里是一个10,000以内的位置。它们也需要在同一条染色体上。
# A tibble: 5 x 4
SNP chr POS Wald.pval
<chr> <int> <int> <dbl>
1 4:31970946:T:C 4 31970946 0.000000620
2 10:4627686:T:C 10 4627686 0.000000296
3 11:109018337:C:T 11 109018337 0.000000487
4 15:52448759:A:G 15 52448759 0.000000471
5 16:73596272:C:T 16 73596272 0.000000493我更喜欢使用tidyverse方法来做这件事,我不希望在整个数据结构上使用循环,但我欢迎任何解决方案。距离不一定是一个固定的距离,我希望能够评估每条记录到它的邻居的距离,并基于此进行呼叫,而不是批量截止,但我可以使用这两种方法。
发布于 2019-11-28 05:19:10
如果你使用GenomicRanges包,这会容易得多。在那里,你可以通过染色体将它们分开,也可以创建所谓的区域。首先,我们从tibble或dataframe开始:
df <- structure(list(SNP = structure(c(8L, 9L, 1L, 2L, 3L, 4L, 5L,
6L, 7L), .Label = c("10:4627686:T:C", "11:109015746:A:T", "11:109018337:C:T",
"11:109018391:G:C", "11:109019179:G:C", "15:52448759:A:G", "16:73596272:C:T",
"4:31970946:T:C", "4:32001877:A:C"), class = "factor"), chr = c(4L,
4L, 10L, 11L, 11L, 11L, 11L, 15L, 16L), POS = c(31970946L, 32001877L,
4627686L, 109015746L, 109018337L, 109018391L, 109019179L, 52448759L,
73596272L), Wald.pval = c(6.2e-07, 7.07e-07, 2.96e-07, 6.34e-07,
4.87e-07, 4.87e-07, 8.24e-07, 4.71e-07, 4.93e-07)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))我们创建GenomicRanges对象
library(GenomicRanges)
# make a genomic range object
gr = GRanges(seqnames=df$chr,
IRanges(start=df$POS,end=df$POS),Wald.pval=df$Wald.pval)
names(gr) = df$SNP
# you can change this
FLANK = 10000在GenomicRanges调用reduce中有一个很好的函数,我将解释它是如何工作的。首先,我们可以使用以下命令来扩展snp
flank(gr,FLANK,both=TRUE)
GRanges object with 9 ranges and 2 metadata columns:
seqnames ranges strand | Wald.pval region
<Rle> <IRanges> <Rle> | <numeric> <integer>
10:4627686:T:C 10 4617686-4637685 * | 2.96e-07 3
15:52448759:A:G 15 52438759-52458758 * | 4.71e-07 5 您可以看到这将snp扩展了10kb。现在,如果我们在这个“侧翼”区域上进行缩减,它会折叠它们:
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 4 31960946-31980945 *
[2] 4 31991877-32011876 *
[3] 10 4617686-4637685 *
[4] 11 109005746-109029178 *
[5] 15 52438759-52458758 *
[6] 16 73586272-73606271 *例如,11号染色体上的区域现在包含了11号染色体上的所有snps。
#flank your snps by 10kb and we merge all these regions together
REGIONS <- reduce(flank(gr,FLANK,both=TRUE))
# each SNP can only be matched to one merged region
# so we just find overlap between region and snp
# and assign the snp to the region
gr$region = subjectHits(findOverlaps(gr,REGIONS))
# order by pvalue
gr = gr[order(gr$Wald.pval),]
# keep only the top snp in each region
gr[!duplicated(gr$region)]
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | Wald.pval region
<Rle> <IRanges> <Rle> | <numeric> <integer>
10:4627686:T:C 10 4627686 * | 2.96e-07 3
15:52448759:A:G 15 52448759 * | 4.71e-07 5
11:109018337:C:T 11 109018337 * | 4.87e-07 4
16:73596272:C:T 16 73596272 * | 4.93e-07 6
4:31970946:T:C 4 31970946 * | 6.2e-07 1
4:32001877:A:C 4 32001877 * | 7.07e-07 2请注意,chr4上的最后两个snps被保留,因为它们之间的距离大于10kb。
https://stackoverflow.com/questions/59078139
复制相似问题