首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >我需要计算n-gram频率

我需要计算n-gram频率
EN

Stack Overflow用户
提问于 2021-01-03 04:40:00
回答 1查看 75关注 0票数 1

我有推文列表,我需要知道两个单词的n-gram,首先我要将列表转换为str like

代码语言:javascript
复制
text_ = str(list_)

然后,文本将如下所示

代码语言:javascript
复制
'Based today data dshs website almost vaccine received unused At current vaccination rate take entire st qtr get group done And gets harder What say ye COVID vaccine' 'That thing About teachers incl staff At least cohorts students groups attend classes together day quarantined home given time past months It catch vaccine'

并导入库

代码语言:javascript
复制
from collections import Counter
from nltk import ngrams 

然后应用我的代码

代码语言:javascript
复制
n_gram = 2

terms = Counter(ngrams(text_.split(), n_gram))

我得到了

我想要的最终结果应该如下所示

代码语言:javascript
复制
[(('based', 'today'), 2),
 (('vaccine ', 'recived'), 2),
 (('attend ', 'happening'), 1),
 (('that', 'the'), 1)]

任何帮助,非常感谢致以最良好的敬意

EN

回答 1

Stack Overflow用户

发布于 2021-01-03 21:21:41

假设您有文本字符串(Tweet)列表:

代码语言:javascript
复制
texts = ('Based today data dshs website almost vaccine received unused '
        'At current vaccination rate take entire st qtr get group done ' 
        'And gets harder What say ye COVID vaccine',
        'That thing About teachers incl staff At least cohorts students '
        ' groups attend classes together day quarantined home given '
        'time past months It catch vaccine')

然后,您可以像下面这样进行一元、二元和三元语法:

代码语言:javascript
复制
for text in texts:
    unigrams = text.split()
    unigram_counts = {}
    for unigram in unigrams:
        unigram_counts[unigram] = unigram_counts.get(unigram, 0) +1

    bigrams = [",".join(bigram) for bigram in zip(unigrams[:-1], unigrams[1:])]
    bigram_counts = {}
    for bigram in bigrams:
        bigram_counts[bigram] = bigram_counts.get(bigram, 0) +1

    trigrams = [",".join(trigram) for trigram in zip(unigrams[:-2], unigrams[1:-1],unigrams[2:])]
    trigram_counts = {}
    for trigram in trigrams:
        trigram_counts[trigram] = trigram_counts.get(trigram, 0) +1

    print(bigram_counts)

代码语言:javascript
复制
{'That,thing': 1, 'thing,About': 1, 'About,teachers': 1, ...

如果您希望按元组的频率列表进行排序,则可以执行以下操作:

代码语言:javascript
复制
list_of_tuples = [(tuple(key.split(',')),value) for key,value in bigram_counts.items()]
sorted(list_of_tuples,key=lambda x:-x[1])
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65543862

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档