我有一个列表列表,其中每个内部列表都是一个句子,它被标记为单词:
sentences = [['farmer', 'plants', 'grain'],
['fisher', 'catches', tuna'],
['police', 'officer', 'fights', 'crime']]目前,我尝试这样计算nGrams:
numSentences = len(sentences)
nGrams = []
for i in range(0, numSentences):
nGrams.append(list(ngrams(sentences, 2)))这会导致找到整个列表的二元语法,而不是每个内部列表的单个单词(并且它对句子的数量重复,这在某种程度上是可以预测的):
[[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
(['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])],
[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
(['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])],
[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
(['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])]]如何计算每句话(按单词)的nGrams?换句话说,如何确保nGrams不会跨越多个列表项?下面是我想要的输出:
farmer plants
plants grain
fisher catches
catches tuna
police officer
officer fights
fights crime发布于 2017-05-13 01:37:31
取每个句子的ngram,并将结果汇总在一起。您可能想要计算它们,而不是将它们放在一个巨大的集合中。从作为单词列表的sentences开始:
counts = collections.Counter() # or nltk.FreqDist()
for sent in sentences:
counts.update(nltk.ngrams(sent, 2))或者,如果您更喜欢单个字符串而不是元组,您的键:
for sent in sentences:
count.update(" ".join(n) for n in nltk.ngrams(sent, 2))这真的是所有的事情了。然后你可以看到最常见的,等等。
print(counts.most_common(10))PS。如果你真的想堆积二元组,你可以这样做。(您的代码形成了句子而不是单词的“二元语法”,因为您忽略了编写sentences[i]。)但是跳过这一步,直接计算它们。
all_ngrams = []
for sent in sentences:
all_ngrams.extend(nltk.ngrams(sent, 2))发布于 2017-05-13 04:57:16
您也可以考虑使用scikit learn的CountVectorizer作为替代方案。
from sklearn.feature_extraction.text import CountVectorizer
sents = list(map(lambda x: ' '.join(x), sentences)) # input is a list of sentences so I map join first
count_vect = CountVectorizer(ngram_range=(2,2)) # bigram
count_vect.fit(sents)
count_vect.vocabulary_这将为您提供:
{'catches tuna': 0,
'farmer plants': 1,
'fights crime': 2,
'fisher catches': 3,
'officer fights': 4,
'plants grain': 5,
'police officer': 6}发布于 2017-05-12 23:14:01
使用list comprehension和chain to flatten the list
>>> from itertools import chain
>>> from collections import Counter
>>> from nltk import ngrams
>>> x = [['farmer', 'plants', 'grain'], ['fisher', 'catches', 'tuna'], ['police', 'officer', 'fights', 'crime']]
>>> Counter(chain(*[ngrams(sent,2) for sent in x]))
Counter({('plants', 'grain'): 1, ('police', 'officer'): 1, ('farmer', 'plants'): 1, ('officer', 'fights'): 1, ('fisher', 'catches'): 1, ('fights', 'crime'): 1, ('catches', 'tuna'): 1})
>>> c = Counter(chain(*[ngrams(sent,2) for sent in x]))获取keys of the Counter dictionary
>>> c.keys()
[('plants', 'grain'), ('police', 'officer'), ('farmer', 'plants'), ('officer', 'fights'), ('fisher', 'catches'), ('fights', 'crime'), ('catches', 'tuna')]>>> [' '.join(b) for b in c.keys()]
['plants grain', 'police officer', 'farmer plants', 'officer fights', 'fisher catches', 'fights crime', 'catches tuna']https://stackoverflow.com/questions/43939344
复制相似问题