文章/答案/技术大牛

发布

社区首页 >问答首页 >Python3中的NLP -统计大字符串中特定项的出现情况

问Python3中的NLP -统计大字符串中特定项的出现情况
EN

Stack Overflow用户

提问于 2019-08-23 22:08:07

回答 2查看 62关注 0票数 0

我有很多文件，里面有几页文字。在循环遍历每个文件时，我想提取我特别感兴趣的术语的计数。

例如，我有如下所示(简化的示例-实际为2-5页的文本)：

to_process = 'soccer football soccer asdlkj assdasda asdsasad  football soccer'
print(to_process)

我想数数“足球”和“足球”出现在课文中的次数：

dict_of_counts = {'soccer':0,'football':0}
print(dict_of_counts)

预期产出如下：

expected_output = {'soccer':3,'football':2}

有谁能给我提供一些线索，说明我将如何以最有效的方式解决这个问题(我有数千篇论文，数百个条款，我会寻找)。

python-3.x

pandas

numpy

nlp

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-08-26 13:03:23

为了让代码处理大写和标点符号，我建议使用flashtext包：

to_process = 'Soccer, football soccer, asdlkj assdasda asdsasad  football; soccer.'
from flashtext import KeywordProcessor
kp = KeywordProcessor()
words_to_look_for = ['soccer', 'football']
for a in words_to_look_for:
    kp.add_keyword(a)
foundList = kp.extract_keywords(to_process)
dict_of_counts = {}
for a in foundList:
    dict_of_counts[a] = dict_of_counts.get(a, 0) +1
print(dict_of_counts)
#{'soccer': 3, 'football': 2}

票数 1

Stack Overflow用户

发布于 2019-08-23 22:26:14

您可以使用dict理解(使用collections.Counter和re.sub)：

import re
from collections import Counter

to_process = '>>SocceR... !football! soccer *asdlkj assdasda? asdsasad ; FOOtball;  soCCer'

words = ['soccer', 'football']

all_counts = Counter(re.sub(r'\W+', ' ', to_process).lower().split())

dict_of_counts = {w : all_counts[w] for w in words}

print(dict_of_counts)

输出：

{'soccer': 3, 'football': 2}

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57633505

复制

相似问题

问Python3中的NLP -统计大字符串中特定项的出现情况
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python3中的NLP -统计大字符串中特定项的出现情况EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python3中的NLP -统计大字符串中特定项的出现情况
EN