我有很多文件,里面有几页文字。在循环遍历每个文件时,我想提取我特别感兴趣的术语的计数。
例如,我有如下所示(简化的示例-实际为2-5页的文本):
to_process = 'soccer football soccer asdlkj assdasda asdsasad football soccer'
print(to_process)我想数数“足球”和“足球”出现在课文中的次数:
dict_of_counts = {'soccer':0,'football':0}
print(dict_of_counts)预期产出如下:
expected_output = {'soccer':3,'football':2}有谁能给我提供一些线索,说明我将如何以最有效的方式解决这个问题(我有数千篇论文,数百个条款,我会寻找)。
发布于 2019-08-26 13:03:23
为了让代码处理大写和标点符号,我建议使用flashtext包:
to_process = 'Soccer, football soccer, asdlkj assdasda asdsasad football; soccer.'
from flashtext import KeywordProcessor
kp = KeywordProcessor()
words_to_look_for = ['soccer', 'football']
for a in words_to_look_for:
kp.add_keyword(a)
foundList = kp.extract_keywords(to_process)
dict_of_counts = {}
for a in foundList:
dict_of_counts[a] = dict_of_counts.get(a, 0) +1
print(dict_of_counts)
#{'soccer': 3, 'football': 2}发布于 2019-08-23 22:26:14
您可以使用dict理解(使用collections.Counter和re.sub):
import re
from collections import Counter
to_process = '>>SocceR... !football! soccer *asdlkj assdasda? asdsasad ; FOOtball; soCCer'
words = ['soccer', 'football']
all_counts = Counter(re.sub(r'\W+', ' ', to_process).lower().split())
dict_of_counts = {w : all_counts[w] for w in words}
print(dict_of_counts)输出:
{'soccer': 3, 'football': 2}https://stackoverflow.com/questions/57633505
复制相似问题