我之前问过类似的问题,但之前的解决方案没有解决我的问题。我一直在四处游荡和测试,但没有一样东西能正常工作。
我有一个超过500个序列的fasta文件,并且我需要从该文件中构建一个表,所以我正在尝试编写一个脚本来完成这项工作,而不是通过复制-粘贴来手动完成。我正在使用Biopython读取文件:seq=SeqIO.parse(handle, "fasta")
从每个序列中,我想知道蛋白质序列所属的物种,蛋白质的名称和Uniprot it。当我使用SeqIO解析fasta文件时,我注意到没有太多的信息可以从它解析。
以下是我的fasta文件的一个子集:
>gi|194757291|ref|XP_001960898.1| GF11270 [Drosophila ananassae] >gi|190622196|gb|EDV37720.1| GF11270 [Drosophila ananassae] MSAARTSQDCDCTAKCRLRQHGNTITAALTKRSISSQNLAAFVYKTCGNFANILDDLGRSAVHMSASTGRYEILEWLLNH GAYINGQDYESGSSPLHRALYYGSIDCAVLLLRYGASMELLDEDTCCPLQAICRKCDVDDFATDSQNDVLVWGSNKNYNL GVGSEQNTNAPQSVDFFRKSNIWIEQVALGAYHSLFLDKKGHLYAVGHGKGGRLGTGGENTLPAPKRVKVSSKLGSEDSI RCISVSRQHSLVLTHRSLVFACGLNSDCQLGVRDAPEHLAQFKEVVALRDKGASDLVRVIACDQHSIAYGSRCVYVWGAN QGQFGISANIASIVVPTLIKLPARTTIRFVEANNAATVIYSEEKMIYLYYAEKTRAIKTPNYEDLKSISVMGGHIKNSAK GSAAALKLLMLTETNVVYLWYENTQQFYRCNFLPIRLPQIKKILYKCNQVMVLSEDGCVYRGKCNQIALPASELQEKSRP NLDIWQNNDQNRTEISREHVIRIELQRVPNIDRAVDISCDEGFSSFAVLQESQGKYFRKPTLPRKEHSFKKLLHDTSDCD AVHDVVFHVDGEKYPAHKYIIYSRAPGLRELVRMYLDKDIYLNFENLTGKMFELVLKHIYTNYWPTEDDIDCIQQSLGPA NPQNRSRTCQMFLPHLEKFQLTELAKYVKSYVQDHQFPLPSARQRLPRLHRSDYPELYDVKIKCEDGQVLQAHKCMLVAR LEYFEMMFMHSWAERSSVTMEGVPAEYMEPVLDYLYSLEAEAFCKQAYLETFLYNMITICDQYFIESLQNLCELLILDKI SIRKCGEMLEFATMYNCKLLLKGCMDFICQNLARVLCYRSIEQCDGETLKCLNDHYRNMFSRVFDYRQITPFSEAIEDEL LLSFIDGLEVDLEYRMDAESKAKQAAKTKQKDLRKLNARHQYEQRAISSMMRSISISESNPAPEVATSPQESARSETNNW SRVIDKKEQKRKQAETALKVNKTLKQETSPEPEMVPIERTPVNEQTPPPLSPETEPSTPLNKSYNLDFSSLTPQSQKLSQ KQRKRLSSESKSWRGNSSALLESPTTPVPVPNAWGVTTTPSSSFNDSYTSPTTGSSSDPTSFANMMRSQAASSSATSKDQ SQNFSKILADERRQRESYERMRNKSLVHTQIEETAIAELREFYNVDNIDDEKITIARKSRPSDINFSTWIRQ
>gi|198456847|ref|XP_001360463.2| GA20796 [Drosophila pseudoobscura pseudoobscura] >gi|198135774|gb|EAL25038.2| GA20796 [Drosophila pseudoobscura pseudoobscura] MSTAKAQEYDCTAKCTCRQHGNSITAALTKRSIDNQNLGAFIFKTCGNFANIIDDLGRSAVHMSASVARYEILEWLLNHG AYINGLDYESGSSPLHRALYYGSIDCAVLLLRYGASLELLDEDTRCPLQAICRKCDEDFTTESQNDVLVWGSNKNYNLGI GNEQNTNAPQAVDFFRKSNIWIEQVALGAYHSLFCDKKGHLYAVGHGKGGRLGIGVENSLPAPKRVKVSSKLNDDSIMCI SVSRQHSLLLTRRSLVFACGINTDHQLGVRDAPENLTQFREVVALRDKGASDLLRVIACDQHSIAYSTKCVYVWGANQGQ FGISRTTDTIMAPTLIKLPARTSIRFVEANNAATVIYTEEKMITLFYGDKTRYIKTPNYEDLKSIAVIGGHLKSSTKGSA AALKLLMLTETNVVFLWYENTQQFYRCNFSPIRLPEIKKILYKCNQVLILSLDGCVYRGKCNQIALPAGILEEKSKPNMD IWHNNDQNRTEISREHVIRIELQRVPNIDRATDIFCDESFSSFAVLQESHMKYFRKPPLPRREHNFKKLYHDTCESDAVH DVVFHVDGERFAAHKFILYSRAPGLRELTRIYLDKDVYLNFENLTGKMFELILKYIYTSYWPTEDDIDCIQESLGPANPR ERSRACEMFIPHLEMFQLVDLARYLQSYVRDNQFPIPSTRQRFNRLHRSDYPELYDVRIVCEDSKVLEAHKCMLVSRLEY FEMMFTHSWAERTTVNMEGVPAEYMEPVLDYLYSLDTEAFCKQNYTETFLYNMVTFCDQYFIESLQNVCESLILDKISIR KCGEMLDFAAMYNCKLLHKGCMDFICHNLARVLCYRSIEQCDEATLKCLNDHYRKMFSNVFDYRQITPFSEAIEDELLLS FVVDCDIDLDYRMDPETKLKAAAKHKQKDLRRQDARHYYEQQAISSMMRSLSVSESASGPEATTGPQESTRSEGKNWSRV VDKKEQKRKLADTALKVNNTLKLEEPPRPELEVIERALMKEQTPPPTSPAEETSTPLSKSYNLDLSSLTPQSQKLSQKQR KRLSSESKSWRSPLVEQEPTTPVAVPNAWGLPPATPSSSSFTDSPATGSISDPTSFANMMRGQAAAATTPTEKGQSFSRI LADERRQRESFERMRNKSLAHTQIEETAIAELREFYNVDNTDDETITIERKSRPTDINFSTWLKH
>gi|355695434|gb|AES00009.1| inhibitor of Bruton agammaglobulinemia tyrosine kinase [Mustela putorius furo] KPGNKLKLNQKKCSFLCDVTMKSVDGKEFTCHKCVLCARLEYFHSMLSSSWIEASTCTALEMPIHSDILKVILDYLYTDE AVVIKESQNVDFVCSVLVVADQLLITRLKGMCEVALTEKLTLKNAAMILEFAAMYNAEQLKLSCLQFIGLNM有没有办法从这些序列中获得蛋白质名称,Uniprot ID和有机体?例如,我想从seq.description中解析genebank ID,然后用这个id在genebank中搜索,但我认为这是不可能的,并不是所有的序列都有genebank ID。有什么建议可以这样做吗?任何帮助都将不胜感激。
所需输出的示例:
name organism uniprot id family
GF11270 Sophophora B3MFN0
GA20796 Sophophora Q291S4 发布于 2013-02-14 07:43:15
你也可以问一下生物之星:http://www.biostars.org/
从fasta报头中提取ACN。例如: GF11270
并使用uniprot REST API检索与此ACN关联的记录
http://www.uniprot.org/uniprot/?query=GF11270&sort=score&format=xml
http://www.uniprot.org/uniprot/?query=GF11270&sort=score&format=txt
http://www.uniprot.org/uniprot/?query=GF11270&sort=score&format=tab
https://stackoverflow.com/questions/14864729
复制相似问题