文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用5_prime_utr文件创建基因id及其gff3编码区域的字典？我不能使用Biopython来完成这项任务。

问如何使用5_prime_utr文件创建基因id及其gff3编码区域的字典？我不能使用Biopython来完成这项任务。
EN

Stack Overflow用户

提问于 2020-12-13 08:03:54

回答 1查看 188关注 0票数 1

我的代码：

GFF = raw_input("Please enter gff3 file: ")

GFF = open(GFF, "r")

GFF= GFF.read()

new_dict = {}

for i in GFF:
    element = i.split()
    if (element[2] == "five_prime_UTR"):
        if element[7] in new_dict:
            new_dict[element[2]]+= 1
        if element[3] in new_dict:
             new_dict[element[3]] += 1

我得到的element[2] == "five_prime_UTR" 索引超出了范围

如何为geneid (如Zm00001d027231 )及其五个素数utr区域号(如50887 )创建字典。我一直试图做到这一点，首先等于五个素数utr区域，然后从那里出发。

期望输出

new_dict ={Zm00001d027231:50887}

gff3文件是一个基因注释文件。它看起来像这样：

1       gramene exon    55222   55682   .       -       .       Parent=transcript:Zm00001d027231_T003;Name=Zm00001d027231_T003.exon1;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=Zm00001d027231_T003.exon1;rank=1
1       gramene five_prime_UTR  55549   55682   .       -       .       Parent=transcript:Zm00001d027231_T003
1       gramene mRNA    50887   55668   .       -       .       ID=transcript:Zm00001d027231_T004;Parent=gene:Zm00001d027231;biotype=protein_coding;transcript_id=Zm00001d027231_T004
1       gramene three_prime_UTR 50887   51120   .       -       .       Parent=transcript:Zm00001d027231_T004
1       gramene exon    50887   51239   .       -       .       Parent=transcript:Zm00001d027231_T004;Name=Zm00001d027231_T004.exon9;constitutive=0;ensembl_e

bioinformatics

genetics

python

python-3.x

python-2.7

回答 1

Stack Overflow用户

发布于 2020-12-15 23:57:51

变量GFF保存gff3文件的内容。

现在，您正在遍历每个字符的文件字符的内容

>>> for i in GFF:
>>>    print(i)
1
 
 
 
 
 
 
 
g
r
a
m
e
n
e
 
e
x
o
n
[and so on]

您希望使用for i in GFF.splitlines():逐行遍历文件内容。

您还可以使代码更加清晰一些，为正在解析的字段指定名称，例如：

new_dict = {}

# https://m.ensembl.org/info/website/upload/gff3.html
gff3_fields = ['seqid', #  name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seq ID must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
               'source', # name of the program that generated this feature, or the data source (database or project name)
               'type', # type of feature. Must be a term or accession from the SOFA sequence ontology
               'start', # Start position of the feature, with sequence numbering starting at 1.
               'end', # End position of the feature, with sequence numbering starting at 1.
               'score', # A floating point value.
               'strand', # defined as + (forward) or - (reverse).
               'phase', # One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
               'attributes' #  A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent  
]

for line in GFF.splitlines():
    feature = dict(zip(gff3_fields, line.split()))
    if feature['type'] == 'three_prime_UTR':
      attributes = feature['attributes']
      geneid = attributes.split(':')[-1].split('_')[0]
      new_dict[geneid] = feature['start']

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65273570

复制

相似问题

问如何使用5_prime_utr文件创建基因id及其gff3编码区域的字典？我不能使用Biopython来完成这项任务。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用5_prime_utr文件创建基因id及其gff3编码区域的字典？我不能使用Biopython来完成这项任务。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用5_prime_utr文件创建基因id及其gff3编码区域的字典？我不能使用Biopython来完成这项任务。
EN