我有推文列表,我需要知道两个单词的n-gram,首先我要将列表转换为str like
text_ = str(list_)然后,文本将如下所示
'Based today data dshs website almost vaccine received unused At current vaccination rate take entire st qtr get group done And gets harder What say ye COVID vaccine' 'That thing About teachers incl staff At least cohorts students groups attend classes together day quarantined home given time past months It catch vaccine'并导入库
from collections import Counter
from nltk import ngrams 然后应用我的代码
n_gram = 2
terms = Counter(ngrams(text_.split(), n_gram))我得到了

我想要的最终结果应该如下所示
[(('based', 'today'), 2),
(('vaccine ', 'recived'), 2),
(('attend ', 'happening'), 1),
(('that', 'the'), 1)]任何帮助,非常感谢致以最良好的敬意
发布于 2021-01-03 21:21:41
假设您有文本字符串(Tweet)列表:
texts = ('Based today data dshs website almost vaccine received unused '
'At current vaccination rate take entire st qtr get group done '
'And gets harder What say ye COVID vaccine',
'That thing About teachers incl staff At least cohorts students '
' groups attend classes together day quarantined home given '
'time past months It catch vaccine')然后,您可以像下面这样进行一元、二元和三元语法:
for text in texts:
unigrams = text.split()
unigram_counts = {}
for unigram in unigrams:
unigram_counts[unigram] = unigram_counts.get(unigram, 0) +1
bigrams = [",".join(bigram) for bigram in zip(unigrams[:-1], unigrams[1:])]
bigram_counts = {}
for bigram in bigrams:
bigram_counts[bigram] = bigram_counts.get(bigram, 0) +1
trigrams = [",".join(trigram) for trigram in zip(unigrams[:-2], unigrams[1:-1],unigrams[2:])]
trigram_counts = {}
for trigram in trigrams:
trigram_counts[trigram] = trigram_counts.get(trigram, 0) +1
print(bigram_counts){'That,thing': 1, 'thing,About': 1, 'About,teachers': 1, ...如果您希望按元组的频率列表进行排序,则可以执行以下操作:
list_of_tuples = [(tuple(key.split(',')),value) for key,value in bigram_counts.items()]
sorted(list_of_tuples,key=lambda x:-x[1])https://stackoverflow.com/questions/65543862
复制相似问题