我有一个sentences列表,其中包含了500,000 sentences。还有一个包含concepts的13,000,000 concepts列表。对于每个句子,我想按照句子的顺序从concepts中提取sentences,并将其写入输出。
例如,我的python程序如下所示。
import re
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process',
'interdisciplinary subfield', 'information', 'knowledge discovery',
'methods', 'machine learning', 'patterns', 'process']
output = []
counting = 0
re_concepts = [re.escape(t) for t in concepts]
find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall
for sentence in sentences:
output.append(find_all_concepts(sentence))
print(output)输出是;[['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process']]
然而,输出的顺序对我来说并不重要。也就是说,我的输出也可以如下所示(换句话说,中的列表可以洗牌)。
[['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]
[['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]然而,由于我的sentences和concepts的长度,这个程序仍然相当慢。
是否有可能在python中使用多线程来进一步提高性能(就时间而言)?
发布于 2019-01-08 19:57:11
这个答案将在不使用并发的情况下解决提高性能的问题。
你的结构,你的搜索方式,你正在寻找1300万独特的东西,在每个句子。你说每句话需要3-5分钟,concepts中的单词长度从1到10不等。
我认为,您可以通过创建一组concepts (最初是在构造时或者从列表中),然后将每个句子分成1到10个(连续的)字符串,并测试集合中的成员资格,从而提高搜索时间。
一个句子分裂成4个字串的例子:
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'
# becomes
[('data', 'mining', 'is', 'the'),
('mining', 'is', 'the', 'process'),
('is', 'the', 'process', 'of'),
('the', 'process', 'of', 'discovering'),
('process', 'of', 'discovering', 'patterns'),
('of', 'discovering', 'patterns', 'in'),
('discovering', 'patterns', 'in', 'large'),
('patterns', 'in', 'large', 'data'),
('in', 'large', 'data', 'sets'),
('large', 'data', 'sets', 'involving'),
('data', 'sets', 'involving', 'methods'),
('sets', 'involving', 'methods', 'at'),
('involving', 'methods', 'at', 'the'),
('methods', 'at', 'the', 'intersection'),
('at', 'the', 'intersection', 'of'),
('the', 'intersection', 'of', 'machine'),
('intersection', 'of', 'machine', 'learning'),
('of', 'machine', 'learning', 'statistics'),
('machine', 'learning', 'statistics', 'and'),
('learning', 'statistics', 'and', 'database'),
('statistics', 'and', 'database', 'systems')]进程:
concepts = set(concepts)
sentence = sentence.split()
#one word
for meme in sentence:
if meme in concepts:
#keep it
#two words
for meme in zip(sentence,sentence[1:]):
if ' '.join(meme) in concepts:
#keep it
#three words
for meme in zip(sentence,sentence[1:],sentence[2:]):
if ' '.join(meme) in concepts:
#keep it调整迭代工具配方(成对),您可以自动从一个句子中生成n个单词字符串:
from itertools import tee
def nwise(iterable, n=2):
"s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
iterables = tee(iterable, n)
# advance each iterable to the appropriate starting point
for i, thing in enumerate(iterables[1:],1):
for _ in range(i):
next(thing, None)
return zip(*iterables)测试每个句子如下所示
sentence = sentence.strip().split()
for n in [1,2,3,4,5,6,7,8,9,10]:
for meme in nwise(sentence,n):
if ' '.join(meme) in concepts:
#keep meme我制作了一组13e6随机字符串,每个字符串有20个字符,以近似于concepts。
import random, string
data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))在data中测试四四十个字符串的成员资格需要花费大约60纳秒的时间。100个单词的句子有955个1到10个字串,所以搜索这个句子需要60微秒。
示例'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'中的第一句包含了195个可能的概念(1到10个字串)。以下两个函数的定时大致相同:f约为140微秒,g约为150微秒
def f(sentence, data=data, nwise=nwise):
'''iterate over memes in sentence and see if they are in data'''
sentence = sentence.strip().split()
found = []
for n in [1,2,3,4,5,6,7,8,9,10]:
for meme in nwise(sentence,n):
meme = ' '.join(meme)
if meme in data:
found.append(meme)
return found
def g(sentence, data=data, nwise=nwise):
'make a set of the memes in sentence then find its intersection with data'''
sentence = sentence.strip().split()
test_strings = set(' '.join(meme) for n in range(1,11) for meme in nwise(sentence,n))
found = test_strings.intersection(data)
return found所以这些只是近似,因为我没有使用你们的实际数据,但是它应该会加速很多。
在对示例数据进行测试之后,我发现如果一个概念出现在一个句子中两次,那么g将无法工作。
在这里,所有的概念都列在每个句子中的顺序中。新版本的f将需要更长的时间,但增加的时间应该相对较少。如果可能的话,你会给我一条评论,让我知道它比原版长多少?(我很好奇)。
from itertools import tee
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process',
'interdisciplinary subfield', 'information', 'knowledge discovery',
'methods', 'machine learning', 'patterns', 'process']
concepts = set(concepts)
def nwise(iterable, n=2):
"s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
iterables = tee(iterable, n)
# advance each iterable to the appropriate starting point
for i, thing in enumerate(iterables[1:],1):
for _ in range(i):
next(thing, None)
return zip(*iterables)
def f(sentence, concepts=concepts, nwise=nwise):
'''iterate over memes in sentence and see if they are in concepts'''
indices = set()
#print(sentence)
words = sentence.strip().split()
for n in [1,2,3,4,5,6,7,8,9,10]:
for meme in nwise(words,n):
meme = ' '.join(meme)
if meme in concepts:
start = sentence.find(meme)
end = len(meme)+start
while (start,end) in indices:
#print(f'{meme} already found at character:{start} - looking for another one...')
start = sentence.find(meme, end)
end = len(meme)+start
indices.add((start, end))
return [sentence[start:end] for (start,end) in sorted(indices)]
###########
results = []
for sentence in sentences:
results.append(f(sentence))
#print(f'{sentence}\n\t{results[-1]})')
In [20]: results
Out[20]:
[['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'],
['data mining', 'interdisciplinary subfield', 'information', 'information'],
['data mining', 'knowledge discovery', 'databases process', 'process']]发布于 2019-01-07 01:08:21
多线程是否会产生实际的性能提高,不仅取决于Python中的实现和数据量,还取决于硬件的执行程序。在某些情况下,在硬件不具备任何优势的情况下,多线程可能会因为开销的增加而减慢速度。
然而,假设您运行的是现代标准PC或更高,您可能会看到一些改进与多线程。问题是设立一批工人,把工作交给他们,然后收集结果。
保持接近示例结构、实现和命名:
import re
import queue
import threading
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process',
'interdisciplinary subfield', 'information', 'knowledge discovery',
'methods', 'machine learning', 'patterns', 'process']
re_concepts = [re.escape(t) for t in concepts]
find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall
def do_find_all_concepts(q_in, l_out):
while True:
sentence = q_in.get()
l_out.append(find_all_concepts(sentence))
q_in.task_done()
# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []
# any reasonable number of workers
num_threads = 2
for i in range(num_threads):
worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
# once there's nothing but daemon threads left, Python exits the program
worker.daemon = True
worker.start()
# put all the input on the queue
for s in sentences:
sentences_q.put(s)
# wait for the entire queue to be processed
sentences_q.join()
print(output)User @wwii询问多个线程是否真正影响cpu绑定问题的性能。与使用多个线程访问相同的输出变量不同,您还可以使用多个进程访问一个共享输出队列,如下所示:
import re
import queue
import multiprocessing
sentences = [
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process',
'interdisciplinary subfield', 'information', 'knowledge discovery',
'methods', 'machine learning', 'patterns', 'process']
re_concepts = [re.escape(t) for t in concepts]
find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall
def do_find_all_concepts(q_in, q_out):
try:
while True:
sentence = q_in.get(False)
q_out.put(find_all_concepts(sentence))
except queue.Empty:
pass
if __name__ == '__main__':
# default maxsize of 0, infinite queue size
sentences_q = multiprocessing.Queue()
output_q = multiprocessing.Queue()
# any reasonable number of workers
num_processes = 2
pool = multiprocessing.Pool(num_processes, do_find_all_concepts, (sentences_q, output_q))
# put all the input on the queue
for s in sentences:
sentences_q.put(s)
# wait for the entire queue to be processed
pool.close()
pool.join()
while not output_q.empty():
print(output_q.get())更多的开销,但也使用其他内核上的CPU资源。
发布于 2019-01-07 03:52:11
下面是两种使用concurrent.futures.ProcessPoolExecutor的解决方案,它们将任务分配给不同的进程。您的任务似乎是cpu绑定的,而不是i/o绑定的,因此线程可能不会有帮助。
import re
import concurrent.futures
# using the lists in your example
re_concepts = [re.escape(t) for t in concepts]
all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL)
def f(sequence, regex=all_concepts):
result = regex.findall(sequence)
return result
if __name__ == '__main__':
out1 = []
with concurrent.futures.ProcessPoolExecutor() as executor:
futures = [executor.submit(f, s) for s in sentences]
for future in concurrent.futures.as_completed(futures):
try:
result = future.result()
except Exception as e:
print(e)
else:
#print(result)
out1.append(result)
out2 = []
with concurrent.futures.ProcessPoolExecutor() as executor:
for result in executor.map(f, sentences):
#print(result)
out2.append(result)Executor.map()有一个chunksize参数:医生们说发送大于一个可迭代项的块可能是有益的。需要对该职能进行重构,以说明这一点。我用一个函数测试它,这个函数只返回它发送的内容,但是不管它的大小如何,我指定的测试函数只返回单个项。去想想吧?
def h(sequence):
return sequenceMultiprocessing的一个缺点是,数据必须被序列化/腌制才能发送到需要时间的进程,而且对于一个编译的正则表达式来说可能很重要--这可能会破坏从多个进程中获得的收益。
我制作了一组13e6随机字符串,每个字符串有20个字符,以近似于编译后的正则表达式。
data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))对io.BytesIO流的酸洗大约需要7.5秒,而从io.BytesIO流中提取则需要9秒。如果使用多处理解决方案,则可能有益于将概念对象(无论以何种形式)腌制到硬盘驱动器一次,然后让每个进程从硬盘驱动器中取出,而不是每次创建新进程时在IPC的每一侧进行酸洗/不酸洗,绝对值得测试- YMMV。我的硬盘上的腌制设备是380 MB。
当我尝试用concurrent.futures.ProcessPoolExecutor做一些实验时,我一直在炸我的电脑,因为每个进程都需要它自己的副本,而我的计算机只是没有足够的内存。
我将发布另一个关于句子中概念的测试方法的答案。
https://stackoverflow.com/questions/54067234
复制相似问题