文章/答案/技术大牛

发布

社区首页 >问答首页 >Python - count列表中字符串的词频，列表中的单词数各不相同

问Python - count列表中字符串的词频，列表中的单词数各不相同
EN

Stack Overflow用户

提问于 2019-08-16 23:58:58

回答 1查看 231关注 0票数 0

我正在尝试创建一个程序，它可以运行一系列心理健康术语，查看研究摘要，并计算单词或短语出现的次数。我可以使用单个单词来实现这一点，但我很难使用多个单词来实现这一点。我也试过使用NLTK ngram，但由于心理健康列表中的单词数量不同(即，并不是所有心理健康列表中的术语都是二元语法或三元语法)，我也无法使其工作。

我想强调的是，我知道拆分每个单词只能计算单个单词，然而，我只是卡在如何处理我的列表中的不同数量的单词以在摘要中计数。

谢谢!

from collections import Counter

abstracts = ['This is a mental health abstract about anxiety and bipolar 
disorder as well as other things.', 'While this abstract is not about ptsd 
or any trauma-related illnesses, it does have a mental health focus.']

for x2 in abstracts:


    mh_terms = ['bipolar disorder', 'anxiety', 'substance abuse disorder', 
    'ptsd', 'schizophrenia', 'mental health']

    c = Counter(s.lower().replace('.', '') for s in x2.split())
    for term in mh_terms:
        term = term.replace(',','')
        term = term.replace('.','')
        xx = (term, c.get(term, 0))

    mh_total_occur = sum(c.get(v, 0) for v in mh_terms)
    print(mh_total_occur)

在我的例子中，两个抽象的计数都是1，但我想要的是2。

string

text

count

python-collections

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-08-17 00:18:03

问题是，你永远不会匹配“精神健康”，因为你只计算被“”字符分割的单个单词的出现次数。

我不知道在这里使用计数器是否是正确的解决方案。如果您确实需要一个高度可伸缩和可索引的解决方案，那么n-gram可能是可行的，但是对于中小型问题，使用正则表达式模式匹配应该是相当快的。

import re

abstracts = [
    'This is a mental health abstract about anxiety and bipolar disorder as well as other things.',
    'While this abstract is not about ptsd or any trauma-related illnesses, it does have a mental health focus.'
]

mh_terms = [
    'bipolar disorder', 'anxiety', 'substance abuse disorder',
    'ptsd', 'schizophrenia', 'mental health'
]

def _regex_word(text):
    """ wrap text with special regex expression for start/end of words """
    return '\\b{}\\b'.format(text)

def _normalize(text):
    """ Remove any non alpha/numeric/space character """
    return re.sub('[^a-z0-9 ]', '', text.lower())


normed_terms = [_normalize(term) for term in mh_terms]


for raw_abstract in abstracts:
    print('--------')
    normed_abstract = _normalize(raw_abstract)

    # Search for all occurrences of chosen terms
    found = {}
    for norm_term in normed_terms:
        pattern = _regex_word(norm_term)
        found[norm_term] = len(re.findall(pattern, normed_abstract))
    print('found = {!r}'.format(found))
    mh_total_occur = sum(found.values())
    print('mh_total_occur = {!r}'.format(mh_total_occur))

我尝试添加帮助器、函数和注释，以明确我在做什么。

在一般用例中，使用\b正则表达式控制字符很重要，因为它可以防止“未命中”等可能的搜索词与“驳回”等词相匹配。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57527726

复制

相似问题

问Python - count列表中字符串的词频，列表中的单词数各不相同
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python - count列表中字符串的词频，列表中的单词数各不相同EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python - count列表中字符串的词频，列表中的单词数各不相同
EN