如果在txt文件( .fasta文件,Interested proteins.txt)中列出了蛋白质in,我想从新文件(selected_proteins.fasta)中的一个蛋白质文件(swissprot_canonical-isoforms.fasta)中提取蛋白质序列的子集。
下面显示了swissprot_canonical-isoforms.fasta中的部分蛋白质序列。蛋白质ID显示在以">“开头的行中的两个”AC.26“之间。例如,"P04637“是一个蛋白质ID。
sp|P04637|P53_HUMAN细胞肿瘤抗原( p53 OS=Homo OS=Homo GN=TP53 PE=1 SV=4 ) GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD >sp P 04637-2倍p53 _人亚型2细胞肿瘤抗原>sp P 04637-3_p5 3_p5 3人细胞肿瘤抗原OS=Homo GN=TP53 PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQMLLDLRWCYFLINSS
3
下面是Interested proteins.txt中列出的一些蛋白质ID
Q6ZWH5
Q8NG66
P51955
P51957
P04629最后的输出应该如下所示(只列出Q6ZWH5的序列作为示例):
>sp|Q6ZWH5|NEK10_HUMAN Serine/threonine-protein kinase Nek10 OS=Homo sapiens GN=NEK10 PE=2 SV=3
MPDQDKKVKTTEKSTDKQQEITIRDYSDLKRLRCLLNVQSSKQQLPAINFDSAQNSMTKS
EPAIRAGGHRARGQWHESTEAVELENFSINYKNERNFSKHPQRKLFQEIFTALVKNRLIS
REWVNRAPSIHFLRVLICLRLLMRDPCYQEILHSLGGIENLAQYMEIVANEYLGYGEEQH
TVDKLVNMTYIFQKLAAVKDQREWVTTSGAHKTLVNLLGARDTNVLLGSLLALASLAESQ
ECREKISELNIVENLLMILHEYDLLSKRLTAELLRLLCAEPQVKEQVKLYEGIPVLLSLL
HSDHLKLLWSIVWILVQVCEDPETSVEIRIWGGIKQLLHILQGDRNFVSDHSSIGSLSSA
NAAGRIQQLHLSEDLSPREIQENTFSLQAACCAALTELVLNDTNAHQVVQENGVYTIAKL
ILPNKQKNAAKSNLLQCYAFRALRFLFSMERNRPLFKRLFPTDLFEIFIDIGHYVRDISA
YEELVSKLNLLVEDELKQIAENIESINQNKAPLKYIGNYAILDHLGSGAFGCVYKVRKHS
GQNLLAMKEVNLHNPAFGKDKKDRDSSVRNIVSELTIIKEQLYHPNIVRYYKTFLENDRL
YIVMELIEGAPLGEHFSSLKEKHHHFTEERLWKIFIQLCLALRYLHKEKRIVHRDLTPNN
IMLGDKDKVTVTDFGLAKQKQENSKLTSVVGTILYSCPEVLKSEPYGEKADVWAVGCILY
QMATLSPPFYSTNMLSLATKIVEAVYEPVPEGIYSEKVTDTISRCLTPDAEARPDIVEVS
SMISDVMMKYLDNLSTSQLSLEKKLERERRRTQRYFMEANRNTVTCHHELAVLSHETFEK
ASLSSSSSGAASLKSELSESADLPPEGFQASYGKDEDRACDEILSDDNFNLENAEKDTYS
EVDDELDISDNSSSSSSSPLKESTFNILKRSFSASGGERQSQTRDFTGGTGSRPRPALLP
LDLLLKVPPHMLRAHIKEIEAELVTGWQSHSLPAVILRNLKDHGPQMGTFLWQASAGIAV
SQRKVRQISDPIQQILIQLHKIIYITQLPPALHHNLKRRVIERFKKSLFSQQSNPCNLKS
EIKKLSQGSPEPIEPNFFTADYHLLHRSSGGNSLSPNDPTGLPTSIELEEGITYEQMQTV
IEEVLEESGYYNFTSNRYHSYPWGTKNHPTKR有没有办法用蟒蛇来做这件事?任何帮助都将不胜感激。
发布于 2020-06-19 07:50:02
您可以使用pyfasta (从python到FASTA格式的接口)来完成这一操作。
from pyfasta import Fasta
f = Fasta('fasta.fa') # open the file
targets = {"P04637"} # define your target IDs
selection = {}
for key in f:
candidateKey = key.split("|")[1]
if candidateKey in targets:
selection[key] = f[key][:]
print(key)
print(selection[key])输出:
sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 PE=1 SV=4
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAP
TPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAM
AIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKK
KPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSDhttps://stackoverflow.com/questions/62465131
复制相似问题