对于赋值,我需要将以前的函数实现为一个新函数,给定FASTA文件和一个min和max分子量,返回在给定区间内具有分子量的序列的序列ID列表。
这是我以前的职责:
def Dict_MW(file_name):
with open(file_name) as seq_file:
seq_dict = {}
for record in SeqIO.parse(seq_file, 'fasta'):
d = IUPACData.ambiguous_dna_values
ambiguous_dna = list(map("".join, product(*map(d.get, record))))
mol_weight = []
for seq in ambiguous_dna:
mol_weight.append(SeqUtils.molecular_weight(seq))
tuple = (min(mol_weight),max(mol_weight))
if min(mol_weight) != max(mol_weight):
seq_dict[record.id] = (min(mol_weight), max(mol_weight))
else:
seq_dict[record.id] = min(mol_weight)
print(seq_dict)这个函数打印一个字典,作为键,ID和分子量是值。
这是一个新的功能:
def List(file_name, mw_min, mw_max):
with open(file_name) as seq_file:
seq_dict = {}
ID = []
for record in SeqIO.parse(seq_file, 'fasta'):
d = IUPACData.ambiguous_dna_values
ambiguous_dna = list(map("".join, product(*map(d.get, record))))
mol_weight = []
for seq in ambiguous_dna:
mol_weight.append(SeqUtils.molecular_weight(seq))
tuple = (min(mol_weight),max(mol_weight))
if min(mol_weight) != max(mol_weight):
seq_dict[record.id] = (min(mol_weight), max(mol_weight))
else:
seq_dict[record.id] = min(mol_weight)
for values in mol_weight:
if mw_min <= values <= mw_max:
ID.append(seq_dict.keys())
print(ID)它工作,但它不是正确的输出。它给出了所有的ID,而不仅仅是在给定的分子区间内的唯一ID。
我使用的Fasta文件:
>seq_7009 random sequence
DGRGGGWAVCVAACGTTGAT
>seq_418 random sequence
GAGCTGVTATST
>seq_9143_unamb random sequence
ACCGTTAAGCCTTAG
>seq_2888 random sequence
RVCCWDGARATAGBCGC
>seq_1101 random sequence
CSAATGYGATNBTA
>seq_107 random sequence
WGDGHGCDCTYANGTTWCA
>seq_6946 random sequence
TCVMBRAGRSGTCCAWA
>seq_6162 random sequence
YWBGCKTGCCAAGCGCDG
>seq_504 random sequence
ADDTAACCCTCTTKA
>seq_3535 random sequence
KKGTACACCAG
>seq_4077 random sequence
SRWSCRTTRVAGDCC
> seq_1626_unamb random sequence
GGATATTACCTA发布于 2021-12-19 23:05:10
这就是我试图解决这个问题的方法,我假设我们有相同的python任务。
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import SeqUtils
import matplotlib.pyplot as plt
## function description
def unamb_MW(filename):
mol_weight_list = []
mol_weight_dict = dict()
nucleotides ={'A','T', 'C', 'G'} # om de nucleotiden te defineren zodat bij ambiguous seq, de N als niet nucleotide wordt herkend.
with open(filename) as file: #omdat de filename een fasta bestand is moeten we die omzetten naar string zodat we de biopython functie SeqUtils kunnen gebruiken voor de moleculaire massa te berekenen.
for record in SeqIO.parse(file, "fasta"): #met SeqIO.parse wordt de fasta file klaar gezet om gelezen te worden.
for nucl in record: #om alle seq in de fasta file te doorlopen.
if nucl in nucleotides: # om ambiguous seq van unambiguous seq te scheiden, want dit zal ons een error vermjden van SeqUtils omdat SeqUtils enkel met ambiguous seq werkt.
continue
else:
print(str(record.id)+": is ambiguous") # om ambiguous seq te printen zo kan je de fasta open doen en controleren als je code wel zeker de ambiguous buiten laat.
break
else:
mol_weight= Bio.SeqUtils.molecular_weight(record.seq) #biopython functie om moleculaire massa te berekenen.
print(str(record.id)+": is unambiguous & molecular weight = "+str(mol_weight))
mol_weight_list.append(mol_weight)
mol_weight_dict[str(record.id)] = mol_weight
#print(mol_weight_dict)
return mol_weight_dict
def MW_list(filename, min_MW, max_MW):
mol_weight = unamb_MW(filename)
for record in mol_weight:
if min_MW < mol_weight[record]:
if mol_weight[record] < max_MW:
print('\n', [record])
else:
pass
else:
pass ´´´
#If you're taking the course computational biology then we have the same assignment.https://stackoverflow.com/questions/70351467
复制相似问题