文章/答案/技术大牛

发布

社区首页 >问答首页 >计算小鼠基因的平行序列的数量给出了错误的频率

问计算小鼠基因的平行序列的数量给出了错误的频率
EN

Stack Overflow用户

提问于 2018-01-27 03:13:04

回答 1查看 95关注 0票数 2

我正在尝试使用BioMart来计算人类蛋白质编码基因的小鼠同源物的平行序列的数量。但例如，在“PLIN4”基因中，它计算的是35,000个类似物，而不是4个。

我们认为这是因为一些基因有一对多的类似物，这会导致重复。当我运行单个基因时，它会给我返回正确的副词数量。有没有一种方法可以从结果中删除这些重复，或者有一种方法可以绕过它，这样BioMart就不会输出这些重复。

我也想过，也许一次运行一个基因，然后通过建立某种循环来计算它，这样它就会自动计算列表中的所有基因。

到目前为止，我编写的代码是：

# Load the biomaRt package:

library(biomaRt)
ensembl_hsapiens <- useMart("ensembl", 
                          dataset = "hsapiens_gene_ensembl")
ensembl_mouse <- useMart("ensembl", 
                       dataset = "mmusculus_gene_ensembl")

# Get all human protein coding genes:

hsapien_PC_genes <- getBM(attributes = c("ensembl_gene_id", 
                                         "external_gene_name"), 
                          filters = "biotype", 
                          values = "protein_coding", 
                          mart = ensembl_hsapiens)


ensembl_gene_ID <- hsapien_PC_genes$ensembl_gene_id

# Get mouse homologues

mouse_homologues <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", 
                                       "mmusculus_homolog_associated_gene_name"), 
                        filters = "ensembl_gene_id", 
                        values = c(ensembl_gene_ID), 
                        mart = ensembl_hsapiens)

# Get mouse external gene name 

mouse_homologues_external_gene_names <- mouse_homologues$mmusculus_homolog_associated_gene_name


mouse_paralogues <- getBM(attributes = c("hsapiens_homolog_associated_gene_name",
                                       "external_gene_name",
                                       "mmusculus_paralog_associated_gene_name"), 
                        filters = "external_gene_name", 
                        values = c(mouse_homologues_external_gene_names) , mart = ensembl_mouse)

# Remove genes with no paralogues 
mouse_paralogs_data <- mouse_paralogues[!(is.na(mouse_paralogues$mmusculus_paralog_associated_gene_name)
                                          | 
mouse_paralogues$mmusculus_paralog_associated_gene_name==""), ]

# Count paralogues per gene

library(plyr)
count_mouse_paralogues <- count(mouse_paralogs_data, "external_gene_name")
View(count_mouse_paralogues)

希望有人能帮忙

谢谢

杰克

bioinformatics

biomart

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-01-29 04:26:45

除了评论之外，下面是我得到的信息：

# R version 3.4.2 (2017-09-28)
library(biomaRt) # biomaRt_2.32.1
library(dplyr)   # dplyr_0.7.4

# test how many paralogues for gene PLIN4
nrow(mouse_paralogs_data[
  mouse_paralogs_data$hsapiens_homolog_associated_gene_name == "PLIN4", ])
# [1] 4

# now summarise for all genes
res <- mouse_paralogs_data %>% 
  group_by(hsapiens_homolog_associated_gene_name) %>% 
  summarise(DistinctP = n_distinct(mmusculus_paralog_associated_gene_name))

# test again number of paralogues for gene PLIN4
res[ res$hsapiens_homolog_associated_gene_name == "PLIN4", ]
# # A tibble: 1 x 2
#   hsapiens_homolog_associated_gene_name DistinctP
#   <chr>                                     <int>
# 1 PLIN4                                         4

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48468052

复制

相似问题

问计算小鼠基因的平行序列的数量给出了错误的频率
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算小鼠基因的平行序列的数量给出了错误的频率EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算小鼠基因的平行序列的数量给出了错误的频率
EN