首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何加快这个瓶颈:在Python中多次加载字典

如何加快这个瓶颈:在Python中多次加载字典
EN

Stack Overflow用户
提问于 2019-11-27 11:44:25
回答 2查看 124关注 0票数 0

我编写了一个分析蛋白质-RNA相互作用的算法,我发现以下功能是导致性能问题的瓶颈:

代码语言:javascript
复制
import numpy as np
#len(protein_sequence)~500, len(rna_sequence)~1500

def affinity_matrix(protein_sequence, rna_sequence): 
    python_matrix = [[] for _ in range(len(protein_sequence))]

    for i, AA in enumerate(protein_sequence):
        for base in rna_sequence:
            python_matrix[i].append(scales[base][AA])
            #(Where "scales" is a small dict() with the structure: 
            #scales[base][AA] = float(), with 4 bases and 20 AA, so 80 values total.)
    return(np.array(python_matrix))

我怀疑这段代码中有两个问题,但我不知道如何解决它们:

  1. 我正在从这个“缩放”-dict中检索值数百万次,据说从“非静态”数据结构(如dict() )调用值是缓慢的。那么,我如何才能使这个小字典“静态”呢?(此词典只创建一次,而每次调用该函数时,rna_sequence和protein_sequence将有所不同。)
  2. 我首先使用pythons工具构建矩阵,然后将其转换为(更快的) Numpy数组。使用Numpy直接创建它可能更快,但我不确定这是否可能。

我很感谢任何技巧,如何改进这段代码或参考指南,在这种情况下将有所帮助。

编辑:

下面是示例数据,以防您想要尝试该算法:

代码语言:javascript
复制
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释 
* rna_sequence = 'ACAGGAGGAGCCGCUCGCUGGCGGCUGAUCCAGCGUCUCCGUGACAGGCACCCUGCUCCGCCGCCACCGCCACCGCCACCGCCACCGUCGCCUUUUCUUCUUCGUCCCGGGCGGUGCGUUCCACUGCUCUGGGGCCGGCGCCGCGCCCAGUCCCGCUUCGGGCCGCAAGCCCCACCGCUCCCCUCCCCGGGCAGGGGCGCCGCGCAGCCCGCUCCCGCCGCCACCUCCUCCCCUGCCGCCCUCCUAGCCGGCAGGAAUUGCGCGACCACAGCGCCGCUCGCGUCGCCCGCAUCAGCUCAGCCCGCUGCCGCUCGGCCCUCGGCACCGCUCCGGGUCCGGCCGCCGCGCGGCCAGGGCUCCCCCUGCCCAGCGCUCCCAGGCCCCGCCACGCGUCGCCGCGCCCAGCUCCAGUCUCCCCUCCCCGGGGUCUCGCCAGCCCCUUCCUGCAGCCGCCGCCUCCGAAGGAGCGGGUCCGCCGCGGGUAACCAUGCCUAGCAAAACCAAGUACAACCUUGUGGACGAUGGGCACGACCUGCGGAUCCCCUUGCACAACGAGGACGCCUUCCAGCACGGCAUCUGCUUUGAGGCCAAGUACGUAGGAAGCCUGGACGUGCCAAGGCCCAACAGCAGGGUGGAGAUCGUGGCUGCCAUGCGCCGGAUACGGUAUGAGUUUAAAGCCAAGAACAUCAAGAAGAAGAAAGUGAGCAUUAUGGUUUCAGUGGAUGGAGUGAAAGUGAUUCUGAAGAAGAAGAAAAAGCUUCUUUUAUUGCAGAAAAAGGAAUGGACGUGGGAUGAGAGCAAGAUGCUGGUGAUGCAGGACCCCAUCUACAGGAUCUUCUAUGUCUCUCAUGAUUCCCAAGACUUGAAGAUCUUCAGCUAUAUCGCUCGAGAUGGUGCCAGCAAUAUCUUCAGGUGUAACGUCUUUAAAUCCAAGAAGAAGAGCCAAGCUAUGAGAAUCGUUCGGACGGUGGGGCAGGCCUUUGAGGUCUGCCACAAGCUGAGCCUGCAGCACACGCAGCAGAAUGCAGAUGGCCAGGAAGAUGGAGAGAGCGAGAGGAACAGCAACAGCUCAGGAGACCCAGGCCGCCAGCUCACUGGAGCCGAGAGGGCCUCCACGGCCACUGCAGAGGAGACUGACAUCGAUGCGGUGGAGGUCCCACUUCCAGGGAAUGAUGUCCUGGAAUUCAGCCGAGGUGUGACUGAUCUAGAUGCUGUAGGGAAGGAAGGAGGCUCUCACACAGGCUCCAAGGUUUCGCACCCCCAGGAGCCCAUGCUGACAGCCUCACCCAGGAUGCUGCUCCCUUCUUCUUCCUCGAAGCCUCCAGGCCUGGGCACAGAGACACCGCUGUCCACUCACCACCAGAUGCAGCUCCUCCAGCAGCUCCUCCAGCAGCAGCAGCAGCAGACACAAGUGGCUGUGGCCCAGGUACACUUGCUGAAGGACCAGUUGGCUGCUGAGGCUGCGGCGCGGCUGGAGGCCCAGGCUCGCGUGCAUCAGCUUUUGCUGCAGAACAAGGACAUGCUCCAGCACAUCUCCCUGCUGGUCAAGCAGGUGCAAGAGCUGGAACUGAAGCUGUCAGGACAGAACGCCAUGGGCUCCCAGGACAGCUUGCUGGAGAUCACCUUCCGCUCCGGAGCCCUGCCCGUGCUCUGUGACCCCACGACCCCUAAGCCAGAGGACCUGCAUUCGCCGCCGCUGGGCGCGGGCUUGGCUGACUUUGCCCACCCUGCGGGCAGCCCCUUAGGUAGGCGCGACUGCUUGGUGAAGCUGGAGUGCUUUCGCUUUCUUCCGCCCGAGGACACCCCGCCCCCAGCGCAGGGCGAGGCGCUCCUGGGCGGUCUGGAGCUCAUCAAGUUCCGAGAGUCAGGCAUCGCCUCGGAGUACGAGUCCAACACGGACGAGAGCGAGGAGCGCGACUCGUGGUCCCAGGAGGAGCUGCCGCGCCUGCUGAAUGUCCUGCAGAGGCAGGAACUGGGCGACGGCCUGGAUGAUGAGAUCGCCGUGUAGGUGCCGAGGGCGAGGAGAUGGAGGCGGCGGCGUGGCUGGAGGGGCCGUGUCUGGCUGCUGCCCGGGUAGGGGAUGCCCAGUGAAUGUGCACUGCCGAGGAGAAUGCCAGCCAGGGCCCGGGAGAGUGUGAGGUUUCAGGAAAGUAUUGAGAUUCUGCUUUGGAGGGUAAAGUGGGGAAGAAAUCGGAUUCCCAGAGGUGAAUCAGCUCCUCUCCUACUUGUGACUAGAGGGUGGUGGAGGUAAGGCCUUCCAGAGCCCAUGGCUUCAGGAGAGGGUCUCUCUCCAGGACUGCCAGGCUGCUGGAGGACCUGCCCCUACCUGCUGCAUCGUCAGGCUCCCACGCUUUGUCCGUGAUGCCCCCCUACCCCCUCACUCUCCCCGUCUCCAUGGUCCCGACCAGGAAGGGAAGCCAUCGGUACCUUCUCAGGUACUUUGUUUCUGGAUAUCACGAUGCUGCGAGUUGCCUAACCCUCCCCCUACCUUUAUGAGAGGAAUUCCUUCUCCAGGCCCUUGCUGAGAUUGUAGAGAUUGAGUGCUCUGGACCGCAAAAGCCAGGCUAGUCCUUGUAGGGUGAGCAUGGAAUUGGAAUGUGUCACAGUGGAUAAGCUUUUAGAGGAACUGAAUCCAAACAUUUUCUCCAGCCGGACAUUGAAUGUUGCUACAAAGGGAGCCUUGAAGCUUUAACAUGGUUCAGGCCCUUGGUGUGAGAGCCCAGGGGGAGGACAGCUUGUCUGCUGCUCCAAAUCACUUAGAUCUGAUUCCUGUUUUGAAAGUCCUGCCCUGCCUUCCUCCUGCCUGUAGCCCAGCCCAUCUAAAUGGAAGCUGGGAAUUGCCCCUCACCUCCCCUGUGUCCUGUCCAGCUGAAGCUUUUGCAGCACUUUACCUCUCUGAAAGCCCCAGAGGACCAGAGCCCCCAGCCUUACCUCUCAACCUGUCCCCUCCACUGGGCAGUGGUGGUCAGUUUUUACUGCAAAAAAAAAAAAAGAAAAAAGAGAAAGAAAAAAAAGAAUGAAUGCAAGCUGAUAGCUGAGACUGUGAGACUGUUUUUGUCCACUCUUCUGAAUCACUGCCACUUGGGUCAGGGACCACAGCCAUUGCCACCCUUGGCCCAUCUCUCUGCGUGCGUGCCUUGAGCACACAUAUAAAAAGUGCCAUGUGCAAUUGUCUUAUCUUUUAUGAUCUAGGCUUUGCCUAGGGAUCACUACUCCUUAACGGGCUGGCUGGGGCAAUGAGGAAAAGCUCCUUUGCUCCUGUAAGGCCAUAAGUGGCUGUUAACAGAUUUUCAAAUGCCUGAAGAGAUUGCUGAGACCUGCUAGAGUCAUAUGUUCGGGGAAUUAAGUCUUUAUCCUAGACAACAAGGUACAGAUGCAAACUGCAGUGUUAUUGGAGGGUCAAUCGGCAAGGAUAUGAUUAUCCCAAAAUGGAGUUCAUCGACCCUAGCUUUCCUUUAGAUUAUAUAUAAAUAAAAGUGCAGUCCUCUUCUAAUGGCCACAGUUGGUUUUCUUGUAGCCCAGAAAGUCCAAAUUAAAGGAAAUAAAUUCAGUUUUAUGUUAGCCUUCCUUGGUGCAUCAGGGUGUCAGUGGAAAUAGGAUCAGGUGGUGUGUGUGUGUGUGUUUUGUGUGUGUGUGUACACAUGUGUUUAUAUAUACAUGUGUGAGGGAAAGUGUGUACAUAUAUGUAGGAUUGUAACCAGACGGAAAAGAACGAGGAUCUCCAGGGUGUUUGAAUCAGCAACAGAUUUGUGUUUUCUAACAUGCAUUUAGUUGGAGAGGCAUGGUUCUGUUUGUUUUGUUUUGAUCUAAUUUGCCAUUGGAAAUAGGUACAGUUACACAGAGAAGGAAGAACCAGGAAAGUGAGAUCCAUGAAACUAAAUGAGCAGCUGUCAGAAUCCAGUGUGGCUGAGCCUACCUAGCUUAUGAAAUCUAACCCAGGGUUCCCUGAGUCCAAGACCACUUAGAUUAUUAAGAUUUUGAACGUCCAGAGGAGUGAAAAGUCUGUUUUCUGACGUAAGCCGGAGCUGAGGAUAAAGCCAGAGGCCAGUGGAUUAGGUGUAUGGAAUGUGGAUGGAGAGGGCUUGUGUGGGAUGUGGCCAGGGAGUGGGUGAGGAAGGCCGCUUCUAAAUGGCCUGUAAAAACUUGAGAUUGGAUAGACGAAAGGAAAUGGAGAAAUUAAAGAAUUGGAGAAACUAGUUAUCUGUGUUGCUGACUUUGGGACCCAUCCAAGACUCCUGCCCUUGGGGUGUUCCAUGGUGGUUUCUUCCUGCCUGGGCGCCACCCUUUCCCCAGUUCAGGCCCUCCCUGGAGGACUAGUUUGUGUAUUGGUAUCCUCCCCAGUGGACCCAAACCAGCGCAUACUUGGUGUGUGGAGAUGGGAGACAAAGGACAGAUCUAGGAGCCUUGAAGGAUCACCAGCCACCGACCCUCCAUCAGGGCCAACUGGGCAGGAAAGGGAACAUUGCAGACCUGAUUUCCCGACGAUGUCACCCUGUCCUCCCUCCUUGCUUCUUGCUCUGCUAACUCAACUCUGCCUUCCUCUUUUUCAUUCUUCUACUCUGCCCUAUAUGGAGGACAAAUGGACACCAGGGGUGCUAACCUUAUUGGUGCCUGCCCCAGCCUACCCCAGGUGCCAGCAGACUCUCGUGCACAGGAGGCUCCCACAGUUAUGGAGCCAGGAAAGAAUUUCUCUGCACUGGAUGGACUGUAUAUUGAGAUUAAAAAUUAUAUUCCUUAUAUUCCUGCUUAUAUCAAUGCUCUCUCUGUAAAACCUCUUCCUAGCCUCAUUUCUCUCAACUGAUCUUGUUUAGGCGUUGUAUUCCUUUUAUUUACUCUUUGCUUGACUGCUUCCUCCUAACCCUCUACCCACUAGCACUCUACUUCCUAAAGCUGUUGUGUCAUUAACUCUGUUGGAUCAACUCUCUGGGAAAAGAUUCUGUUAAUGUAAGUGCACUUACUCCCUGGAUGUUGUCACUAGUCUAGUGGCUUUUGCUAAAUAAACCUUUCUUAUUUCUA'
*/
protein_sequence = 'MPSKTKYNLVDDGHDLRIPLHNEDAFQHGICFEAKYVGSLDVPRPNSRVEIVAAMRRIRYEFKAKNIKKKKVSIMVSVDGVKVILKKKKKLLLLQKKEWTWDESKMLVMQDPIYRIFYVSHDSQDLKIFSYIARDGASNIFRCNVFKSKKKSQAMRIVRTVGQAFEVCHKLSLQHTQQNADGQEDGESERNSNSSGDPGRQLTGAERASTATAEETDIDAVEVPLPGNDVLEFSRGVTDLDAVGKEGGSHTGSKVSHPQEPMLTASPRMLLPSSSSKPPGLGTETPLSTHHQMQLLQQLLQQQQQQTQVAVAQVHLLKDQLAAEAAARLEAQARVHQLLLQNKDMLQHISLLVKQVQELELKLSGQNAMGSQDSLLEITFRSGALPVLCDPTTPKPEDLHSPPLGAGLADFAHPAGSPLGRRDCLVKLECFRFLPPEDTPPPAQGEALLGGLELIKFRESGIASEYESNTDESEERDSWSQEELPRLLNVLQRQELGDGLDDEIAV'
scales = {base:{AA: 1.1111 for AA in "DTSEPGACVMILYFHKRWQN"} for base in "AGCU"}
EN

回答 2

Stack Overflow用户

发布于 2019-11-27 11:51:45

IIUC,您可以使用避免使用追加的嵌套列表理解

代码语言:javascript
复制
def affinity_matrix(protein_sequence, rna_sequence): 
    return np.array([[scales[base][AA] for base in rna_sequence] for AA in protein_sequence])
票数 2
EN

Stack Overflow用户

发布于 2019-11-27 12:10:51

我加载了这个字典数百万次,据说加载“非静态”数据结构很慢。那么,我如何才能使这个小字典“静态”呢?

使用一个使用元组(基,AA)的平面字典作为键,而不是嵌套的字典怎么样?您可以像scales[(base, AA)]一样检索值。我建议尝试这样做,但不完全确定这是否提高了性能,因为对dict的get操作平均来说是O(1)。

我首先使用pythons工具构建矩阵,然后将其转换为(更快的) Numpy数组。使用Numpy直接创建它可能更快,但我不确定这是否可能。

这样做是可能的,而且应该执行得更快。会是这样的。

代码语言:javascript
复制
shape = len(protein_sequence), len(rna_sequence)
arr = np.empty(shape)
for i, AA in enumerate(protein_sequence):
    for j, base in enumerate(rna_sequence):
        arr[i, j] = scales[base][AA]

已更新

当我做一些检查时,像Daniel的答案一样,列表理解比我上面使用for循环的答案要快一些。

使用发电机要快得多。

代码语言:javascript
复制
np.array(((scales[base][AA] for base in rna_sequence) for AA in protein_sequence))
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59069468

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档