我已经编写了一个函数,它返回一个Pandas数据帧(作为行的样本和作为列的描述符),并将输入作为肽的列表(作为字符串数据的生物序列)。"my_function( pep_list )“以pep_list作为参数,返回数据帧。它迭代来自pep_list的每个肽序列,计算描述符,并将所有数据组合为pandas数据帧并返回df:
pep_list = [DAAAAEF,DAAAREF,DAAANEF,DAAADEF,DAAACEF,DAAAEEF,DAAAQEF,DAAAGEF,DAAAHEF,DAAAIEF,DAAALEF,DAAAKEF]示例:
我想用下面给出的算法并行化这段代码:
1. get the number of processor available as .
n = multiprocessing.cpu_count()
2. split the pep_list as
sub_list_of_pep_list = pep_list/n
sub_list_of_pep_list = [[DAAAAEF,DAAAREF,DAAANEF],[DAAADEF,DAAACEF,DAAAEEF],[DAAAQEF,DAAAGEF,DAAAHEF],[DAAAIEF,DAAALEF,DAAAKEF]]
4. run "my_function()" for each core as (example if 4 cores )
df0 = my_function(sub_list_of_pep_list[0])
df1 = my_function(sub_list_of_pep_list[1])
df2 = my_functonn(sub_list_of_pep_list[2])
df3 = my_functonn(sub_list_of_pep_list[4])
5. join all df = concat[df0,df1,df2,df3]
6. returns df with nX speed. 请给我推荐最合适的库来实现这个方法。
谢谢并致以问候。
Updated 通过一些阅读,我能够写出一个代码,它按照我的期望工作,如1.没有并行的情况下,10肽序列2.两次处理12肽3. 4次处理12肽需要4秒
from multiprocessing import Process
def func1():
structure_gen(pep_seq = ["DAAAAEF","DAAAREF","DAAANEF"])
def func2():
structure_gen(pep_seq = ["DAAAQEF","DAAAGEF","DAAAHEF"])
def func3():
structure_gen(pep_seq = ["DAAADEF","DAAALEF"])
def func4():
structure_gen(pep_seq = ["DAAAIEF","DAAALEF"])
if __name__ == '__main__':
p1 = Process(target=func1)
p1.start()
p2 = Process(target=func2)
p2.start()
p3 = Process(target=func1)
p3.start()
p4 = Process(target=func2)
p4.start()
p1.join()
p2.join()
p3.join()
p4.join()但是这个代码很容易与10肽一起工作,但是不能实现包含100万肽的PEP_list
谢谢
发布于 2015-08-19 18:30:37
multiprocessing.Pool.map就是你要找的东西。
试试这个:
from multiprocessing import Pool
# I recommend using more partitions than processes,
# this way the work can be balanced.
# Of course this only makes sense if pep_list is bigger than
# the one you provide. If not, change this to 8 or so.
n = 50
# create indices for the partitions
ix = np.linspace(0, len(pep_list), n+1, endpoint=True, dtype=int)
# create partitions using the indices
sub_lists = [pep_list[i1:i2] for i1, i2 in zip(ix[:-1], ix[1:])]
p = Pool()
try:
# p.map will return a list of dataframes which are to be
# concatenated
df = concat(p.map(my_function, sub_lists))
finally:
p.close()该池将自动包含与可用核心数一样多的进程。但是如果你愿意,你可以改写这个数字,看看文档就知道了。
https://stackoverflow.com/questions/32090810
复制相似问题