首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >解析GenBank文件

解析GenBank文件
EN

Stack Overflow用户
提问于 2014-02-19 18:15:14
回答 1查看 847关注 0票数 1

基本上,一个GenBank文件是由基因条目组成的(由' gene‘宣布,然后是相应的'CDS’条目(每个基因只有一个),就像我在下面展示的两个。我希望在一个由制表符分隔的两列文件中获得locus_tag vs product。‘'gene’和'CDS‘总是在前面加上空格。如果这个任务可以很容易地使用一个已经可用的工具,请告诉我。

输入文件:

代码语言:javascript
复制
 gene            complement(8972..9094)
                 /locus_tag="HAPS_0004"
                 /db_xref="GeneID:7278619"
 CDS             complement(8972..9094)
                 /locus_tag="HAPS_0004"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="YP_002474657.1"
                 /db_xref="GI:219870282"
                 /db_xref="GeneID:7278619"
                 /translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR"
 gene            9632..11416
                 /gene="frdA"
                 /locus_tag="HAPS_0005"
                 /db_xref="GeneID:7278620"
 CDS             9632..11416
                 /gene="frdA"
                 /locus_tag="HAPS_0005"
                 /note="part of four member fumarate reductase enzyme
                 complex FrdABCD which catalyzes the reduction of fumarate
                 to succinate during anaerobic respiration; FrdAB are the
                 catalytic subcomplex consisting of a flavoprotein subunit
                 and an iron-sulfur subunit, respectively; FrdCD are the
                 membrane components which interact with quinone and are
                 involved in electron transfer; the catalytic subunits are
                 similar to succinate dehydrogenase SdhAB"
                 /codon_start=1
                 /transl_table=11
                 /product="fumarate reductase flavoprotein subunit"
                 /protein_id="YP_002474658.1"
                 /db_xref="GI:219870283"
                 /db_xref="GeneID:7278620"
                 /translation="MQTVNVDVAIVGAGGGGLRAAIAAAEANPNLKIALISKVYPMRS
                 HTVAAEGGAAAVAKEEDSYDKHFHDTVAGGDWLCEQDVVEYFVEHSPVEMTQLERWGC
                 PWSRKADGDVNVRRFGGMKIERTWFAADKTGFHLLHTLFQTSIKYPQIIRFDEHFVVD
                 ILVDDGQVRGCVAMNMMEGTFVQINANAVVIATGGGCRAYRFNTNGGIVTGDGLSMAY
                 RHGVPLRDMEFVQYHPTGLPNTGILMTEGCRGEGGILVNKDGYRYLQDYGLGPETPVG
                 KPENKYMELGPRDKVSQAFWQEWRKGNTLKTAKGVDVVHLDLRHLGEKYLHERLPFIC
                 ELAQAYEGVDPAKAPIPVRPVVHYTMGGIEVDQHAETCIKGLFAVGECASSGLHGANR
                 LGSNSLAELVVFGKVAGEMAAKRAVEATARNQAVIDAQAKDVLERVYALARQEGEESW
                 SQIRNEMGDSMEEGCGIYRTQESMEKTVAKIAELKERYKRIKVKDSSSVFNTDLLYKI
                 ELGYILDVAQSISSSAVERKESRGAHQRLDYVERDDVNYLKHTLAFYNADGTPTIKYS
                 DVKITKSQPAKRVYGAEAEAQEAAAKKE"

所需的输出(在一个由标签分隔的两个列文件中的locus_tag和product ):

代码语言:javascript
复制
HAPS_0004 hypothetical protein
HAPS_0005 fumarate reductase flavoprotein subunit

事实上,拥有这样的输出是非常理想的,每个基因只显示一条线(只显示一个基因):

代码语言:javascript
复制
 locus_tag="HAPS_0004" db_xref="GeneID:7278619" complement(8972..9094) codon_start=1 transl_table=11 product="hypothetical protein" protein_id="YP_002474657.1" db_xref="GI:219870282" db_xref="GeneID:7278619" translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR"
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-02-19 18:34:04

代码语言:javascript
复制
perl -nE'
  BEGIN{ ($/, $") = ("CDS", "\t") }
  say "@r[0,1]" if @r= m!/(?:locus_tag|product)="(.+?)"!g and @r>1
' file

输出

代码语言:javascript
复制
HAPS_0004       hypothetical protein
HAPS_0005       fumarate reductase flavoprotein subunit
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/21888945

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档