首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >提速串比较

提速串比较
EN

Stack Overflow用户
提问于 2022-08-07 19:14:56
回答 2查看 40关注 0票数 0

我有一个大的句子列表(大约500万)和一个减少的关键字列表(大约100个单词)。

我需要知道每一个关键词,哪个句子包含它。注意,一个句子可以有任意数量的关键字(包括非)。

使用传统的pythonic命令需要太长时间。我需要大幅度提高表现。有什么建议吗?

我现在的代码是:

代码语言:javascript
复制
# df is a dataframe with all of the sentences
context = list()
for w in keywords:
    xdf = df['sentence'].str.contains(w)
    xdf = df[xdf]
    context.append(xdf.values.tolist())

这比双圈好,但还是太慢了。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-08-07 23:16:46

逼近

使用阿霍-科拉西克算法,允许并行处理多个关键字。

代码语言:javascript
复制
import string
import pandas as pd

import ahocorasick as ahc                          # Word search using Aho-Corasick
from essential_generators import DocumentGenerator # To generate random sentences

# Helper Functions
def make_aho_automaton(keywords):
    '''
        Creates the automation engine
    '''
    A = ahc.Automaton()  # initialize
    for (key, cat) in keywords:
        A.add_word(key, (cat, key)) # add keys and categories
    A.make_automaton() # generate automaton
    return A

def find_keywords(line, A):
    '''
        Finds line  in automation
    '''
    found_keywords = []
    for end_index, (cat, keyw) in A.iter(line):
        start_index = end_index - len(keyw)
        found_keywords.append(keyw)
    return found_keywords

def pre_process_sent(s, make_trans = str.maketrans('', '', string.punctuation)):
    '''
        make lower case
        remove punctuation
        surrond each sentence with whitespace for Aho-Corasick 
    '''
    return f' {s.translate(make_trans).lower()} '

测试

代码语言:javascript
复制
# 1. Generate Test Dataframe with random sentences
#    Document generator
gen = DocumentGenerator()
#    Place sentences in Dataframe without puntuation and surrounded by spaces
df = pd.DataFrame({'sentence':[pre_process_sent(gen.sentence()) for _ in range(100000)]})

# 2. Use 200 of most common English words as keywords(source: https://gist.github.com/deekayen/4148741)
with open('1-1000.txt', 'r') as f:
    # Most 1K popular English words
    keywords = [line.rstrip() for line in f]
    
    # Use top 100 keywords
    keywords = keywords[:100]
    
    # Generate tuple of keyword, categories for Aho-Corasick 
    # (surround with white space to make boundary sensitive)
    keywords_cat = [(f' {w} ', 1) for w in keywords]


# Generate Automaton
A = make_aho_automaton(keywords_cat)

# Check Dataframe Column for match
df['match'] = df.sentence.apply(lambda x: find_keywords(x, A))
print(df)

输出

  • 句子:一排排句子
  • 匹配:每个句子中的关键字列表

df

代码语言:javascript
复制
sentence    match
0   rare instances foreign affairs the organisati...    [ the , of , the ]
1   unangam idiom the emperor of these based on t...    [ the , of , these , on , the , from , to ]
2   to dim applied psychology the iaap is conside...    [ to , the , is , to , be , a , that ]
3   the females tests take a biopsy or prescribe ...    [ the , a , or ]
4   evaporate there linear meters   [ there ]
... ... ...
99995   john 1973 discover only []
99996   does it laughter contrary to a series of para...    [ it , to , a , of , or ]
99997   contemporary virginia into chiefdoms    []
99998   repercussions of site of the acm    [ of , of , the ]
99999   history hlabor island followed by a [ by , a ]
100000 rows × 2 columns

性能

摘要

  • 用100个关键词加速100,000句的~25倍
  1. 使用Aho-Corasick %timeit df.sentence.apply(lambda x: find_keywords(x,A))

结果:每环292 ms±7.79ms(平均±std )。dev.7次运行,每一次循环1次)

  1. 使用张贴代码 %%时间 context = list()表示关键字中的w: xdf =df‘语句’.str.include(W) xdf = dfxdf context.append(xdf.values.tolist())

结果:每环7.28s±172 ms (平均±std )。dev.7次运行,每一次循环1次)

票数 1
EN

Stack Overflow用户

发布于 2022-08-07 19:23:07

如果您在python中所做的只是放慢速度,那么可能(在linux/mac上)使用grepag (银搜索器)。

票数 -2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73270348

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档