文章/答案/技术大牛

发布

社区首页 >问答首页 >如何获得与NLTK搭配的三联图的PMI评分？python

问如何获得与NLTK搭配的三联图的PMI评分？python
EN

Stack Overflow用户

提问于 2014-01-15 03:38:21

回答 3查看 9.1K关注 0票数 3

我知道如何使用NLTK获得bigram和trigram搭配，并将它们应用到我自己的语料库中。代码在下面。

我唯一的问题是如何打印出带有PMI值的小鸟图？我多次搜索NLTK文档。要么是我错过了什么，要么就是不在那里。

import nltk
from nltk.collocations import *

myFile = open("large.txt", 'r').read()
myList = myFile.split()
myCorpus = nltk.Text(myList)
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words((myCorpus))

finder.apply_freq_filter(3)
print finder.nbest(trigram_measures.pmi, 500000)

collocation

python

nlp

nltk

回答 3

Stack Overflow用户

回答已采纳

发布于 2014-01-15 06:52:11

如果您查看nlkt.collocations.TrigramCollocationFinder的源代码(请参阅模块/nltk/collocations.html)，您会发现它返回一个TrigramCollocationFinder().score_ngrams

def nbest(self, score_fn, n):
    """Returns the top n ngrams when scored by the given function."""
    return [p for p,s in self.score_ngrams(score_fn)[:n]]

因此，您可以直接调用score_ngrams()，而无需获取nbest，因为它无论如何都会返回一个排序列表。

def score_ngrams(self, score_fn):
    """Returns a sequence of (ngram, score) pairs ordered from highest to
    lowest score, as determined by the scoring function provided.
    """
    return sorted(self._score_ngrams(score_fn),
                  key=_itemgetter(1), reverse=True)

尝试：

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(text))

for i in finder.score_ngrams(trigram_measures.pmi):
    print i

out

(('this', 'is', 'a'), 9.047123912114026)
(('is', 'a', 'foo'), 7.46216141139287)
(('black', 'sheep', 'shep'), 5.46216141139287)
(('black', 'sheep', 'foo'), 4.877198910671714)
(('a', 'foo', 'bar'), 4.462161411392869)
(('sheep', 'shep', 'bar'), 4.462161411392869)
(('bar', 'black', 'sheep'), 4.047123912114026)
(('bar', 'black', 'sentence'), 4.047123912114026)
(('sheep', 'foo', 'bar'), 3.877198910671714)
(('bar', 'bar', 'black'), 3.047123912114026)
(('foo', 'bar', 'bar'), 3.047123912114026)
(('shep', 'bar', 'bar'), 3.047123912114026)

票数 6

Stack Overflow用户

发布于 2014-01-15 05:44:26

我想你是在找score_ngram。不管怎样，你不需要打印功能。你自己吃吧.

trigrams = finder.nbest(trigram_measures.pmi, 500000)
print [(x, finder.score_ngram(trigram_measures.pmi, x[0], x[1], x[2])) for x in trigrams]

票数 1

Stack Overflow用户

发布于 2022-08-20 14:05:40

NLTK有一个专门的文档页，显示如何使用不同的搭配https://www.nltk.org/howto/collocations.html。

您还可以在下面找到一个示例用法，即如何与BigramCollocationFinder和BigramAssocMeasures一起使用，这是使用点式互信息度量的。

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.tokenize import word_tokenize


text = "Collocations are expressions of multiple words which commonly co-occur. For example, the top ten bigram collocations in Genesis are listed below, as measured using Pointwise Mutual Information."
words = word_tokenize(text)

finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.pmi
# it combines bigram words with `_` to a single str
bigram_collocations = {"_".join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}
print(f"bigram collocations: {bigram_collocations}")

输出

{'For_example': 4.954196310386875, 'Mutual_Information': 4.954196310386875, 'Pointwise_Mutual': 4.954196310386875, 'as_measured': 4.954196310386875, 'bigram_collocations': 4.954196310386875, 'collocations_in': 4.954196310386875, 'commonly_co-occur': 4.954196310386875, 'expressions_of': 4.954196310386875, 'in_Genesis': 4.954196310386875, 'listed_below': 4.954196310386875, 'measured_using': 4.954196310386875, 'multiple_words': 4.954196310386875, 'of_multiple': 4.954196310386875, 'ten_bigram': 4.954196310386875, 'the_top': 4.954196310386875, 'top_ten': 4.954196310386875, 'using_Pointwise': 4.954196310386875, 'which_commonly': 4.954196310386875, 'words_which': 4.954196310386875, ',_as': 3.954196310386875, ',_the': 3.954196310386875, '._For': 3.954196310386875, 'Collocations_are': 3.954196310386875, 'Genesis_are': 3.954196310386875, 'Information_.': 3.954196310386875, 'are_expressions': 3.954196310386875, 'are_listed': 3.954196310386875, 'below_,': 3.954196310386875, 'co-occur_.': 3.954196310386875, 'example_,': 3.954196310386875}

NLTK模块也在nltk.collocations下提供TrigramCollocationFinder和QuadgramCollocationFinder。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/21128689

复制

相似问题

问如何获得与NLTK搭配的三联图的PMI评分？python
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何获得与NLTK搭配的三联图的PMI评分？pythonEN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何获得与NLTK搭配的三联图的PMI评分？python
EN