文章/答案/技术大牛

发布

社区首页 >问答首页 >合并生成器对象以计算NLTK中的频率

问合并生成器对象以计算NLTK中的频率
EN

Stack Overflow用户

提问于 2017-09-27 13:54:19

回答 2查看 3.4K关注 0票数 3

我试图使用ngrams中的ngram和freqDist函数来计算各种nltk的频率。由于ngram函数输出是一个generator对象，所以在计算频率之前，我希望合并来自每个ngram的输出。但是，我遇到了合并各种生成器对象的问题。

我尝试过itertools.chain，它创建了一个itertools对象，而不是合并生成器。我终于决定使用permutations了，但是事后解析对象似乎是多余的。

迄今的工作守则是：

import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
import re
corpus = 'testing sentences to see if if if this works'
token = word_tokenize(corpus)
unigrams = ngrams(token,1)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)


perms = list(permutations([unigrams,bigrams,trigrams]))
fdist = nltk.FreqDist(perms)
for x,y in fdist.items():
    for k in x:
        for v in k:
            words = '_'.join(v)
            print words, y

正如您在结果中所看到的，freq没有正确地计算来自单个生成器对象的单词，因为每个生成器的频率都是1。有更多的pythonic方法来正确地这样做吗？

python-2.7

nltk

generator

word-frequency

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-09-27 15:28:05

使用everygrams，它返回给定范围为n的所有n克.

>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'
>>> everygrams(corpus.split(), 1, 3)
<generator object everygrams at 0x7f4e272e9730>
>>> list(everygrams(corpus.split(), 1, 3))
[('testing',), ('sentences',), ('to',), ('see',), ('if',), ('if',), ('if',), ('this',), ('works',), ('testing', 'sentences'), ('sentences', 'to'), ('to', 'see'), ('see', 'if'), ('if', 'if'), ('if', 'if'), ('if', 'this'), ('this', 'works'), ('testing', 'sentences', 'to'), ('sentences', 'to', 'see'), ('to', 'see', 'if'), ('see', 'if', 'if'), ('if', 'if', 'if'), ('if', 'if', 'this'), ('if', 'this', 'works')]

合并计算不同数量的纳克数：

>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'.split()
>>> fd = FreqDist(everygrams(corpus, 1, 3))
>>> fd
FreqDist({('if',): 3, ('if', 'if'): 2, ('to', 'see'): 1, ('sentences', 'to', 'see'): 1, ('if', 'this'): 1, ('to', 'see', 'if'): 1, ('works',): 1, ('testing', 'sentences', 'to'): 1, ('sentences', 'to'): 1, ('sentences',): 1, ...})

或者， sub-class，因此您可以将计数器组合成这样：

>>> from collections import Counter
>>> x = Counter([1,2,3,4,4,5,5,5])
>>> y = Counter([1,1,1,2,2])
>>> x + y
Counter({1: 4, 2: 3, 5: 3, 4: 2, 3: 1})
>>> x

>>> from nltk import FreqDist
>>> FreqDist(['a', 'a', 'b'])
FreqDist({'a': 2, 'b': 1})
>>> a = FreqDist(['a', 'a', 'b'])
>>> b = FreqDist(['b', 'b', 'c', 'd', 'e'])
>>> a + b
FreqDist({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'd': 1})

票数 7

Stack Overflow用户

发布于 2017-09-27 16:44:39

阿尔瓦斯说得对，nltk.everygrams是这份工作的完美工具。但是合并几个迭代器并不是很难，也不是很少见，所以你应该知道怎么做。关键是任何迭代器都可以转换为列表，但最好只这样做一次：

列出几个迭代器的列表

只需使用列表(简单但效率低下) allgrams =list(Unigram)+list(Bigram)+list(Trigram)
或者正确地构建一个列表 allgrams =list(Unigram)allgrams.extend(Bigram)allgrams.extend(Trigram)
或者使用itertools.chain()，然后列出一个列表 allgrams =list(itertools.chain(unigram，bigram，trigram))

上面的结果是相同的(只要您不尝试重用迭代器、unigrams等等--您需要在示例之间重新定义它们)。

使用迭代器本身

不要对抗迭代器，要学会与它们一起工作。许多Python函数接受它们而不是列表，从而节省了大量的空间和时间。

您可以形成一个迭代器并将其传递给nltk.FreqDist()： fdist =nltk.FreqDist(itertools.chain(unigram，bigram，trigram))
您可以使用多个迭代器。FreqDist和Counter一样，有一个update()方法，您可以使用它来递增地计数事物： fdist =nltk.FreqDist(Unigram)fdist.update(Bigram)fdist.update(Trigram)

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46449738

复制

相似问题

问合并生成器对象以计算NLTK中的频率
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问合并生成器对象以计算NLTK中的频率EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问合并生成器对象以计算NLTK中的频率
EN