我有一个大的句子列表(大约500万)和一个减少的关键字列表(大约100个单词)。
我需要知道每一个关键词,哪个句子包含它。注意,一个句子可以有任意数量的关键字(包括非)。
使用传统的pythonic命令需要太长时间。我需要大幅度提高表现。有什么建议吗?
我现在的代码是:
# df is a dataframe with all of the sentences
context = list()
for w in keywords:
xdf = df['sentence'].str.contains(w)
xdf = df[xdf]
context.append(xdf.values.tolist())这比双圈好,但还是太慢了。
发布于 2022-08-07 23:16:46
逼近
使用阿霍-科拉西克算法,允许并行处理多个关键字。
码
import string
import pandas as pd
import ahocorasick as ahc # Word search using Aho-Corasick
from essential_generators import DocumentGenerator # To generate random sentences
# Helper Functions
def make_aho_automaton(keywords):
'''
Creates the automation engine
'''
A = ahc.Automaton() # initialize
for (key, cat) in keywords:
A.add_word(key, (cat, key)) # add keys and categories
A.make_automaton() # generate automaton
return A
def find_keywords(line, A):
'''
Finds line in automation
'''
found_keywords = []
for end_index, (cat, keyw) in A.iter(line):
start_index = end_index - len(keyw)
found_keywords.append(keyw)
return found_keywords
def pre_process_sent(s, make_trans = str.maketrans('', '', string.punctuation)):
'''
make lower case
remove punctuation
surrond each sentence with whitespace for Aho-Corasick
'''
return f' {s.translate(make_trans).lower()} '测试
# 1. Generate Test Dataframe with random sentences
# Document generator
gen = DocumentGenerator()
# Place sentences in Dataframe without puntuation and surrounded by spaces
df = pd.DataFrame({'sentence':[pre_process_sent(gen.sentence()) for _ in range(100000)]})
# 2. Use 200 of most common English words as keywords(source: https://gist.github.com/deekayen/4148741)
with open('1-1000.txt', 'r') as f:
# Most 1K popular English words
keywords = [line.rstrip() for line in f]
# Use top 100 keywords
keywords = keywords[:100]
# Generate tuple of keyword, categories for Aho-Corasick
# (surround with white space to make boundary sensitive)
keywords_cat = [(f' {w} ', 1) for w in keywords]
# Generate Automaton
A = make_aho_automaton(keywords_cat)
# Check Dataframe Column for match
df['match'] = df.sentence.apply(lambda x: find_keywords(x, A))
print(df)输出
列
df
sentence match
0 rare instances foreign affairs the organisati... [ the , of , the ]
1 unangam idiom the emperor of these based on t... [ the , of , these , on , the , from , to ]
2 to dim applied psychology the iaap is conside... [ to , the , is , to , be , a , that ]
3 the females tests take a biopsy or prescribe ... [ the , a , or ]
4 evaporate there linear meters [ there ]
... ... ...
99995 john 1973 discover only []
99996 does it laughter contrary to a series of para... [ it , to , a , of , or ]
99997 contemporary virginia into chiefdoms []
99998 repercussions of site of the acm [ of , of , the ]
99999 history hlabor island followed by a [ by , a ]
100000 rows × 2 columns性能
摘要
结果:每环292 ms±7.79ms(平均±std )。dev.7次运行,每一次循环1次)
结果:每环7.28s±172 ms (平均±std )。dev.7次运行,每一次循环1次)
发布于 2022-08-07 19:23:07
如果您在python中所做的只是放慢速度,那么可能(在linux/mac上)使用grep或ag (银搜索器)。
https://stackoverflow.com/questions/73270348
复制相似问题