我有一个将数据存储到磁盘的程序。然后在一些迭代过程中对数据进行重新处理。因此,它需要存储、搜索和加载数据集。
让我们考虑以下类Signal,它定义了一个多相信号:
class Signal:
def __init__(self, amp, fq, phases):
self.amp = amp
self.fq = fq
self.phases = phases
# List of signal objects:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]根据信号列表,计算file_name:
def file_name(signals):
amplitudes = tuple([S.amp for S in signals])
frequencies = tuple([S.fq for S in signals])
phases = tuple([S.phases for S in signals])
return "A{}_F{}_P{}.pkl".format(amplitudes, frequencies, phases)对于上面的例子,它将返回:
"A(0.2, 10, 20)_F(50, 200, 20)_P([20, 30], [20, 30], [20, 90]).pkl"如您所见,我正在对文件进行筛选(使用_pickle)。现在,让我们相信,数百个文件已经存储到文件夹:folder。要检查是否计算了特定的信号组合,我将使用:
import itertools
def is_computed(files, signals):
"""
Check if the signals are already computed
"""
return any(file_name(elt) in files for elt in itertools.permutations(signals))我使用的是itertools,因为排列是相关的,即:
signals = [Signal(0.2, 50, [20, 30]), Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90])]
# IS THE SAME AS:
signals = [Signal(10, 200, [20, 30]), Signal(20, 20, [20, 90]), Signal(0.2, 50, [20, 30])]要将文件列表传递给is_computed(),我使用的是:files = os.listdir(folder),它随着文件数量的增加而变得相当低效。
# Folder of 26K files with the size from 1 kB to hundreds of MBs
In: %timeit os.listdir(folder)
Out: 3.75 s ± 842 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)我怎样才能建立一个类似但有效的系统呢?
谢谢你的帮助!
发布于 2018-06-06 11:03:20
最好是设计系统,使每个信号集合都有一个规范的文件名,而不管采集中信号的顺序如何。这是通过对集合中的信号进行排序来实现的:
def canonical_filename(signals):
"Return canonical filename for a collection of signals."
return file_name(sorted(signals, key=lambda s: (s.amp, s.fq, s.phases)))由于现在每个信号集合只有一个文件名,所以不需要列出目录或生成排列:
def is_computed(signals):
"Return True if the file for signals exists, False otherwise."
return os.path.isfile(canonical_filename(signals))我建议设计文件名,使其不包含shell元字符,如空格、括号和括号。这是一个方便,这意味着我们不需要引用文件名时,通过shell操纵他们。例如:
def file_name(signals):
"Return filename for a list of signals."
amplitudes = ','.join(str(s.amp) for s in signals)
frequencies = ','.join(str(s.fq) for s in signals)
phases = ':'.join(','.join(map(str, s.phases)) for s in signals)
return f'A{amplitudes}_F{frequencies}_P{phases}.pkl'https://codereview.stackexchange.com/questions/195938
复制相似问题