文章/答案/技术大牛

发布

社区首页 >问答首页 >在GWAS中查找lead SNP-查找按邻近度分组的行的最小值

问在GWAS中查找lead SNP-查找按邻近度分组的行的最小值
EN

Stack Overflow用户

提问于 2019-11-28 04:42:22

回答 1查看 36关注 0票数 0

我正在尝试从GWAS分析的输出中获得按地区划分的领先SNP。我需要做的是找到附近所有SNP中p值最低的行。下面是我的数据。(还有其他列，但对此选择无关紧要)。

# A tibble: 9 x 4
  SNP                chr       POS   Wald.pval
  <chr>            <int>     <int>       <dbl>
1 4:31970946:T:C       4  31970946 0.000000620
2 4:32001877:A:C       4  32001877 0.000000707
3 10:4627686:T:C      10   4627686 0.000000296
4 11:109015746:A:T    11 109015746 0.000000634
5 11:109018337:C:T    11 109018337 0.000000487
6 11:109018391:G:C    11 109018391 0.000000487
7 11:109019179:G:C    11 109019179 0.000000824
8 15:52448759:A:G     15  52448759 0.000000471
9 16:73596272:C:T     16  73596272 0.000000493\

这是我想要的输出。这些是位置在一定距离内的每个区域的最低pval。这里是一个10,000以内的位置。它们也需要在同一条染色体上。

# A tibble: 5 x 4
  SNP                chr       POS   Wald.pval
  <chr>            <int>     <int>       <dbl>
1 4:31970946:T:C       4  31970946 0.000000620
2 10:4627686:T:C      10   4627686 0.000000296
3 11:109018337:C:T    11 109018337 0.000000487
4 15:52448759:A:G     15  52448759 0.000000471
5 16:73596272:C:T     16  73596272 0.000000493

我更喜欢使用tidyverse方法来做这件事，我不希望在整个数据结构上使用循环，但我欢迎任何解决方案。距离不一定是一个固定的距离，我希望能够评估每条记录到它的邻居的距离，并基于此进行呼叫，而不是批量截止，但我可以使用这两种方法。

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-11-28 05:19:10

如果你使用GenomicRanges包，这会容易得多。在那里，你可以通过染色体将它们分开，也可以创建所谓的区域。首先，我们从tibble或dataframe开始：

df <- structure(list(SNP = structure(c(8L, 9L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L), .Label = c("10:4627686:T:C", "11:109015746:A:T", "11:109018337:C:T", 
"11:109018391:G:C", "11:109019179:G:C", "15:52448759:A:G", "16:73596272:C:T", 
"4:31970946:T:C", "4:32001877:A:C"), class = "factor"), chr = c(4L, 
4L, 10L, 11L, 11L, 11L, 11L, 15L, 16L), POS = c(31970946L, 32001877L, 
4627686L, 109015746L, 109018337L, 109018391L, 109019179L, 52448759L, 
73596272L), Wald.pval = c(6.2e-07, 7.07e-07, 2.96e-07, 6.34e-07, 
4.87e-07, 4.87e-07, 8.24e-07, 4.71e-07, 4.93e-07)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9"))

我们创建GenomicRanges对象

library(GenomicRanges)

# make a genomic range object
gr = GRanges(seqnames=df$chr,
IRanges(start=df$POS,end=df$POS),Wald.pval=df$Wald.pval)
names(gr) = df$SNP

# you can change this
FLANK = 10000

在GenomicRanges调用reduce中有一个很好的函数，我将解释它是如何工作的。首先，我们可以使用以下命令来扩展snp

flank(gr,FLANK,both=TRUE)

GRanges object with 9 ranges and 2 metadata columns:
                   seqnames              ranges strand | Wald.pval    region
                      <Rle>           <IRanges>  <Rle> | <numeric> <integer>
    10:4627686:T:C       10     4617686-4637685      * |  2.96e-07         3
   15:52448759:A:G       15   52438759-52458758      * |  4.71e-07         5

您可以看到这将snp扩展了10kb。现在，如果我们在这个“侧翼”区域上进行缩减，它会折叠它们：

GRanges object with 6 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]        4   31960946-31980945      *
  [2]        4   31991877-32011876      *
  [3]       10     4617686-4637685      *
  [4]       11 109005746-109029178      *
  [5]       15   52438759-52458758      *
  [6]       16   73586272-73606271      *

例如，11号染色体上的区域现在包含了11号染色体上的所有snps。

#flank your snps by 10kb and we merge all these regions together
REGIONS <- reduce(flank(gr,FLANK,both=TRUE))
# each SNP can only be matched to one merged region
# so we just find overlap between region and snp
# and assign the snp to the region
gr$region = subjectHits(findOverlaps(gr,REGIONS))
# order by pvalue
gr = gr[order(gr$Wald.pval),]
# keep only the top snp in each region
gr[!duplicated(gr$region)]

GRanges object with 6 ranges and 2 metadata columns:
                   seqnames    ranges strand | Wald.pval    region
                      <Rle> <IRanges>  <Rle> | <numeric> <integer>
    10:4627686:T:C       10   4627686      * |  2.96e-07         3
   15:52448759:A:G       15  52448759      * |  4.71e-07         5
  11:109018337:C:T       11 109018337      * |  4.87e-07         4
   16:73596272:C:T       16  73596272      * |  4.93e-07         6
    4:31970946:T:C        4  31970946      * |   6.2e-07         1
    4:32001877:A:C        4  32001877      * |  7.07e-07         2

请注意，chr4上的最后两个snps被保留，因为它们之间的距离大于10kb。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59078139

复制

相似问题

问在GWAS中查找lead SNP-查找按邻近度分组的行的最小值
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在GWAS中查找lead SNP-查找按邻近度分组的行的最小值EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在GWAS中查找lead SNP-查找按邻近度分组的行的最小值
EN