我编写了一个分析蛋白质-RNA相互作用的算法,我发现以下功能是导致性能问题的瓶颈:
import numpy as np
#len(protein_sequence)~500, len(rna_sequence)~1500
def affinity_matrix(protein_sequence, rna_sequence):
python_matrix = [[] for _ in range(len(protein_sequence))]
for i, AA in enumerate(protein_sequence):
for base in rna_sequence:
python_matrix[i].append(scales[base][AA])
#(Where "scales" is a small dict() with the structure:
#scales[base][AA] = float(), with 4 bases and 20 AA, so 80 values total.)
return(np.array(python_matrix))我怀疑这段代码中有两个问题,但我不知道如何解决它们:
我很感谢任何技巧,如何改进这段代码或参考指南,在这种情况下将有所帮助。
编辑:
下面是示例数据,以防您想要尝试该算法:
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释
* rna_sequence = 'ACAGGAGGAGCCGCUCGCUGGCGGCUGAUCCAGCGUCUCCGUGACAGGCACCCUGCUCCGCCGCCACCGCCACCGCCACCGCCACCGUCGCCUUUUCUUCUUCGUCCCGGGCGGUGCGUUCCACUGCUCUGGGGCCGGCGCCGCGCCCAGUCCCGCUUCGGGCCGCAAGCCCCACCGCUCCCCUCCCCGGGCAGGGGCGCCGCGCAGCCCGCUCCCGCCGCCACCUCCUCCCCUGCCGCCCUCCUAGCCGGCAGGAAUUGCGCGACCACAGCGCCGCUCGCGUCGCCCGCAUCAGCUCAGCCCGCUGCCGCUCGGCCCUCGGCACCGCUCCGGGUCCGGCCGCCGCGCGGCCAGGGCUCCCCCUGCCCAGCGCUCCCAGGCCCCGCCACGCGUCGCCGCGCCCAGCUCCAGUCUCCCCUCCCCGGGGUCUCGCCAGCCCCUUCCUGCAGCCGCCGCCUCCGAAGGAGCGGGUCCGCCGCGGGUAACCAUGCCUAGCAAAACCAAGUACAACCUUGUGGACGAUGGGCACGACCUGCGGAUCCCCUUGCACAACGAGGACGCCUUCCAGCACGGCAUCUGCUUUGAGGCCAAGUACGUAGGAAGCCUGGACGUGCCAAGGCCCAACAGCAGGGUGGAGAUCGUGGCUGCCAUGCGCCGGAUACGGUAUGAGUUUAAAGCCAAGAACAUCAAGAAGAAGAAAGUGAGCAUUAUGGUUUCAGUGGAUGGAGUGAAAGUGAUUCUGAAGAAGAAGAAAAAGCUUCUUUUAUUGCAGAAAAAGGAAUGGACGUGGGAUGAGAGCAAGAUGCUGGUGAUGCAGGACCCCAUCUACAGGAUCUUCUAUGUCUCUCAUGAUUCCCAAGACUUGAAGAUCUUCAGCUAUAUCGCUCGAGAUGGUGCCAGCAAUAUCUUCAGGUGUAACGUCUUUAAAUCCAAGAAGAAGAGCCAAGCUAUGAGAAUCGUUCGGACGGUGGGGCAGGCCUUUGAGGUCUGCCACAAGCUGAGCCUGCAGCACACGCAGCAGAAUGCAGAUGGCCAGGAAGAUGGAGAGAGCGAGAGGAACAGCAACAGCUCAGGAGACCCAGGCCGCCAGCUCACUGGAGCCGAGAGGGCCUCCACGGCCACUGCAGAGGAGACUGACAUCGAUGCGGUGGAGGUCCCACUUCCAGGGAAUGAUGUCCUGGAAUUCAGCCGAGGUGUGACUGAUCUAGAUGCUGUAGGGAAGGAAGGAGGCUCUCACACAGGCUCCAAGGUUUCGCACCCCCAGGAGCCCAUGCUGACAGCCUCACCCAGGAUGCUGCUCCCUUCUUCUUCCUCGAAGCCUCCAGGCCUGGGCACAGAGACACCGCUGUCCACUCACCACCAGAUGCAGCUCCUCCAGCAGCUCCUCCAGCAGCAGCAGCAGCAGACACAAGUGGCUGUGGCCCAGGUACACUUGCUGAAGGACCAGUUGGCUGCUGAGGCUGCGGCGCGGCUGGAGGCCCAGGCUCGCGUGCAUCAGCUUUUGCUGCAGAACAAGGACAUGCUCCAGCACAUCUCCCUGCUGGUCAAGCAGGUGCAAGAGCUGGAACUGAAGCUGUCAGGACAGAACGCCAUGGGCUCCCAGGACAGCUUGCUGGAGAUCACCUUCCGCUCCGGAGCCCUGCCCGUGCUCUGUGACCCCACGACCCCUAAGCCAGAGGACCUGCAUUCGCCGCCGCUGGGCGCGGGCUUGGCUGACUUUGCCCACCCUGCGGGCAGCCCCUUAGGUAGGCGCGACUGCUUGGUGAAGCUGGAGUGCUUUCGCUUUCUUCCGCCCGAGGACACCCCGCCCCCAGCGCAGGGCGAGGCGCUCCUGGGCGGUCUGGAGCUCAUCAAGUUCCGAGAGUCAGGCAUCGCCUCGGAGUACGAGUCCAACACGGACGAGAGCGAGGAGCGCGACUCGUGGUCCCAGGAGGAGCUGCCGCGCCUGCUGAAUGUCCUGCAGAGGCAGGAACUGGGCGACGGCCUGGAUGAUGAGAUCGCCGUGUAGGUGCCGAGGGCGAGGAGAUGGAGGCGGCGGCGUGGCUGGAGGGGCCGUGUCUGGCUGCUGCCCGGGUAGGGGAUGCCCAGUGAAUGUGCACUGCCGAGGAGAAUGCCAGCCAGGGCCCGGGAGAGUGUGAGGUUUCAGGAAAGUAUUGAGAUUCUGCUUUGGAGGGUAAAGUGGGGAAGAAAUCGGAUUCCCAGAGGUGAAUCAGCUCCUCUCCUACUUGUGACUAGAGGGUGGUGGAGGUAAGGCCUUCCAGAGCCCAUGGCUUCAGGAGAGGGUCUCUCUCCAGGACUGCCAGGCUGCUGGAGGACCUGCCCCUACCUGCUGCAUCGUCAGGCUCCCACGCUUUGUCCGUGAUGCCCCCCUACCCCCUCACUCUCCCCGUCUCCAUGGUCCCGACCAGGAAGGGAAGCCAUCGGUACCUUCUCAGGUACUUUGUUUCUGGAUAUCACGAUGCUGCGAGUUGCCUAACCCUCCCCCUACCUUUAUGAGAGGAAUUCCUUCUCCAGGCCCUUGCUGAGAUUGUAGAGAUUGAGUGCUCUGGACCGCAAAAGCCAGGCUAGUCCUUGUAGGGUGAGCAUGGAAUUGGAAUGUGUCACAGUGGAUAAGCUUUUAGAGGAACUGAAUCCAAACAUUUUCUCCAGCCGGACAUUGAAUGUUGCUACAAAGGGAGCCUUGAAGCUUUAACAUGGUUCAGGCCCUUGGUGUGAGAGCCCAGGGGGAGGACAGCUUGUCUGCUGCUCCAAAUCACUUAGAUCUGAUUCCUGUUUUGAAAGUCCUGCCCUGCCUUCCUCCUGCCUGUAGCCCAGCCCAUCUAAAUGGAAGCUGGGAAUUGCCCCUCACCUCCCCUGUGUCCUGUCCAGCUGAAGCUUUUGCAGCACUUUACCUCUCUGAAAGCCCCAGAGGACCAGAGCCCCCAGCCUUACCUCUCAACCUGUCCCCUCCACUGGGCAGUGGUGGUCAGUUUUUACUGCAAAAAAAAAAAAAGAAAAAAGAGAAAGAAAAAAAAGAAUGAAUGCAAGCUGAUAGCUGAGACUGUGAGACUGUUUUUGUCCACUCUUCUGAAUCACUGCCACUUGGGUCAGGGACCACAGCCAUUGCCACCCUUGGCCCAUCUCUCUGCGUGCGUGCCUUGAGCACACAUAUAAAAAGUGCCAUGUGCAAUUGUCUUAUCUUUUAUGAUCUAGGCUUUGCCUAGGGAUCACUACUCCUUAACGGGCUGGCUGGGGCAAUGAGGAAAAGCUCCUUUGCUCCUGUAAGGCCAUAAGUGGCUGUUAACAGAUUUUCAAAUGCCUGAAGAGAUUGCUGAGACCUGCUAGAGUCAUAUGUUCGGGGAAUUAAGUCUUUAUCCUAGACAACAAGGUACAGAUGCAAACUGCAGUGUUAUUGGAGGGUCAAUCGGCAAGGAUAUGAUUAUCCCAAAAUGGAGUUCAUCGACCCUAGCUUUCCUUUAGAUUAUAUAUAAAUAAAAGUGCAGUCCUCUUCUAAUGGCCACAGUUGGUUUUCUUGUAGCCCAGAAAGUCCAAAUUAAAGGAAAUAAAUUCAGUUUUAUGUUAGCCUUCCUUGGUGCAUCAGGGUGUCAGUGGAAAUAGGAUCAGGUGGUGUGUGUGUGUGUGUUUUGUGUGUGUGUGUACACAUGUGUUUAUAUAUACAUGUGUGAGGGAAAGUGUGUACAUAUAUGUAGGAUUGUAACCAGACGGAAAAGAACGAGGAUCUCCAGGGUGUUUGAAUCAGCAACAGAUUUGUGUUUUCUAACAUGCAUUUAGUUGGAGAGGCAUGGUUCUGUUUGUUUUGUUUUGAUCUAAUUUGCCAUUGGAAAUAGGUACAGUUACACAGAGAAGGAAGAACCAGGAAAGUGAGAUCCAUGAAACUAAAUGAGCAGCUGUCAGAAUCCAGUGUGGCUGAGCCUACCUAGCUUAUGAAAUCUAACCCAGGGUUCCCUGAGUCCAAGACCACUUAGAUUAUUAAGAUUUUGAACGUCCAGAGGAGUGAAAAGUCUGUUUUCUGACGUAAGCCGGAGCUGAGGAUAAAGCCAGAGGCCAGUGGAUUAGGUGUAUGGAAUGUGGAUGGAGAGGGCUUGUGUGGGAUGUGGCCAGGGAGUGGGUGAGGAAGGCCGCUUCUAAAUGGCCUGUAAAAACUUGAGAUUGGAUAGACGAAAGGAAAUGGAGAAAUUAAAGAAUUGGAGAAACUAGUUAUCUGUGUUGCUGACUUUGGGACCCAUCCAAGACUCCUGCCCUUGGGGUGUUCCAUGGUGGUUUCUUCCUGCCUGGGCGCCACCCUUUCCCCAGUUCAGGCCCUCCCUGGAGGACUAGUUUGUGUAUUGGUAUCCUCCCCAGUGGACCCAAACCAGCGCAUACUUGGUGUGUGGAGAUGGGAGACAAAGGACAGAUCUAGGAGCCUUGAAGGAUCACCAGCCACCGACCCUCCAUCAGGGCCAACUGGGCAGGAAAGGGAACAUUGCAGACCUGAUUUCCCGACGAUGUCACCCUGUCCUCCCUCCUUGCUUCUUGCUCUGCUAACUCAACUCUGCCUUCCUCUUUUUCAUUCUUCUACUCUGCCCUAUAUGGAGGACAAAUGGACACCAGGGGUGCUAACCUUAUUGGUGCCUGCCCCAGCCUACCCCAGGUGCCAGCAGACUCUCGUGCACAGGAGGCUCCCACAGUUAUGGAGCCAGGAAAGAAUUUCUCUGCACUGGAUGGACUGUAUAUUGAGAUUAAAAAUUAUAUUCCUUAUAUUCCUGCUUAUAUCAAUGCUCUCUCUGUAAAACCUCUUCCUAGCCUCAUUUCUCUCAACUGAUCUUGUUUAGGCGUUGUAUUCCUUUUAUUUACUCUUUGCUUGACUGCUUCCUCCUAACCCUCUACCCACUAGCACUCUACUUCCUAAAGCUGUUGUGUCAUUAACUCUGUUGGAUCAACUCUCUGGGAAAAGAUUCUGUUAAUGUAAGUGCACUUACUCCCUGGAUGUUGUCACUAGUCUAGUGGCUUUUGCUAAAUAAACCUUUCUUAUUUCUA'
*/
protein_sequence = 'MPSKTKYNLVDDGHDLRIPLHNEDAFQHGICFEAKYVGSLDVPRPNSRVEIVAAMRRIRYEFKAKNIKKKKVSIMVSVDGVKVILKKKKKLLLLQKKEWTWDESKMLVMQDPIYRIFYVSHDSQDLKIFSYIARDGASNIFRCNVFKSKKKSQAMRIVRTVGQAFEVCHKLSLQHTQQNADGQEDGESERNSNSSGDPGRQLTGAERASTATAEETDIDAVEVPLPGNDVLEFSRGVTDLDAVGKEGGSHTGSKVSHPQEPMLTASPRMLLPSSSSKPPGLGTETPLSTHHQMQLLQQLLQQQQQQTQVAVAQVHLLKDQLAAEAAARLEAQARVHQLLLQNKDMLQHISLLVKQVQELELKLSGQNAMGSQDSLLEITFRSGALPVLCDPTTPKPEDLHSPPLGAGLADFAHPAGSPLGRRDCLVKLECFRFLPPEDTPPPAQGEALLGGLELIKFRESGIASEYESNTDESEERDSWSQEELPRLLNVLQRQELGDGLDDEIAV'
scales = {base:{AA: 1.1111 for AA in "DTSEPGACVMILYFHKRWQN"} for base in "AGCU"}发布于 2019-11-27 11:51:45
IIUC,您可以使用避免使用追加的嵌套列表理解:
def affinity_matrix(protein_sequence, rna_sequence):
return np.array([[scales[base][AA] for base in rna_sequence] for AA in protein_sequence])发布于 2019-11-27 12:10:51
我加载了这个字典数百万次,据说加载“非静态”数据结构很慢。那么,我如何才能使这个小字典“静态”呢?
使用一个使用元组(基,AA)的平面字典作为键,而不是嵌套的字典怎么样?您可以像scales[(base, AA)]一样检索值。我建议尝试这样做,但不完全确定这是否提高了性能,因为对dict的get操作平均来说是O(1)。
我首先使用pythons工具构建矩阵,然后将其转换为(更快的) Numpy数组。使用Numpy直接创建它可能更快,但我不确定这是否可能。
这样做是可能的,而且应该执行得更快。会是这样的。
shape = len(protein_sequence), len(rna_sequence)
arr = np.empty(shape)
for i, AA in enumerate(protein_sequence):
for j, base in enumerate(rna_sequence):
arr[i, j] = scales[base][AA]已更新
当我做一些检查时,像Daniel的答案一样,列表理解比我上面使用for循环的答案要快一些。
使用发电机要快得多。
np.array(((scales[base][AA] for base in rna_sequence) for AA in protein_sequence))https://stackoverflow.com/questions/59069468
复制相似问题