文章/答案/技术大牛

发布

社区首页 >问答首页 >提速串比较

问提速串比较
EN

Stack Overflow用户

提问于 2022-08-07 19:14:56

回答 2查看 40关注 0票数 0

我有一个大的句子列表(大约500万)和一个减少的关键字列表(大约100个单词)。

我需要知道每一个关键词，哪个句子包含它。注意，一个句子可以有任意数量的关键字(包括非)。

使用传统的pythonic命令需要太长时间。我需要大幅度提高表现。有什么建议吗？

我现在的代码是：

# df is a dataframe with all of the sentences
context = list()
for w in keywords:
    xdf = df['sentence'].str.contains(w)
    xdf = df[xdf]
    context.append(xdf.values.tolist())

这比双圈好，但还是太慢了。

python

arrays

string

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-08-07 23:16:46

逼近

使用阿霍-科拉西克算法，允许并行处理多个关键字。

使用200个最受欢迎的英语单词作为deekayen/1-1000.txt的关键字
从基本发生器模块生成马尔可夫随机语句
来自[医]皮质激素病的Aho-Corasick模块

码

import string
import pandas as pd

import ahocorasick as ahc                          # Word search using Aho-Corasick
from essential_generators import DocumentGenerator # To generate random sentences

# Helper Functions
def make_aho_automaton(keywords):
    '''
        Creates the automation engine
    '''
    A = ahc.Automaton()  # initialize
    for (key, cat) in keywords:
        A.add_word(key, (cat, key)) # add keys and categories
    A.make_automaton() # generate automaton
    return A

def find_keywords(line, A):
    '''
        Finds line  in automation
    '''
    found_keywords = []
    for end_index, (cat, keyw) in A.iter(line):
        start_index = end_index - len(keyw)
        found_keywords.append(keyw)
    return found_keywords

def pre_process_sent(s, make_trans = str.maketrans('', '', string.punctuation)):
    '''
        make lower case
        remove punctuation
        surrond each sentence with whitespace for Aho-Corasick 
    '''
    return f' {s.translate(make_trans).lower()} '

测试

# 1. Generate Test Dataframe with random sentences
#    Document generator
gen = DocumentGenerator()
#    Place sentences in Dataframe without puntuation and surrounded by spaces
df = pd.DataFrame({'sentence':[pre_process_sent(gen.sentence()) for _ in range(100000)]})

# 2. Use 200 of most common English words as keywords(source: https://gist.github.com/deekayen/4148741)
with open('1-1000.txt', 'r') as f:
    # Most 1K popular English words
    keywords = [line.rstrip() for line in f]
    
    # Use top 100 keywords
    keywords = keywords[:100]
    
    # Generate tuple of keyword, categories for Aho-Corasick 
    # (surround with white space to make boundary sensitive)
    keywords_cat = [(f' {w} ', 1) for w in keywords]


# Generate Automaton
A = make_aho_automaton(keywords_cat)

# Check Dataframe Column for match
df['match'] = df.sentence.apply(lambda x: find_keywords(x, A))
print(df)

输出

列

句子:一排排句子
匹配:每个句子中的关键字列表

sentence    match
0   rare instances foreign affairs the organisati...    [ the , of , the ]
1   unangam idiom the emperor of these based on t...    [ the , of , these , on , the , from , to ]
2   to dim applied psychology the iaap is conside...    [ to , the , is , to , be , a , that ]
3   the females tests take a biopsy or prescribe ...    [ the , a , or ]
4   evaporate there linear meters   [ there ]
... ... ...
99995   john 1973 discover only []
99996   does it laughter contrary to a series of para...    [ it , to , a , of , or ]
99997   contemporary virginia into chiefdoms    []
99998   repercussions of site of the acm    [ of , of , the ]
99999   history hlabor island followed by a [ by , a ]
100000 rows × 2 columns

性能

摘要

用100个关键词加速100,000句的~25倍

使用Aho-Corasick %timeit df.sentence.apply(lambda x: find_keywords(x，A))

结果:每环292 ms±7.79ms(平均±std )。dev.7次运行，每一次循环1次)

使用张贴代码 %%时间 context = list()表示关键字中的w: xdf =df‘语句’.str.include(W) xdf = dfxdf context.append(xdf.values.tolist())

结果:每环7.28s±172 ms (平均±std )。dev.7次运行，每一次循环1次)

票数 1

Stack Overflow用户

发布于 2022-08-07 19:23:07

如果您在python中所做的只是放慢速度，那么可能(在linux/mac上)使用grep或ag (银搜索器)。

票数 -2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73270348

复制

相似问题

问提速串比较
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问提速串比较EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问提速串比较
EN