我试图使用下面的代码在文档中找到单词频率。然而,这不是词频,而是返回字符频率。有人能解释原因吗?我正在跟踪一篇获得这段代码的文章,但由于未显示输出,因此无法验证。
sentence1 = [token for token in "hello how are you".split()]
sentence2 = [token for token in "i am fine thank you".split()]
print(sentence1)
from collections import Counter
import itertools
def map_word_frequency(document):
print (document)
return Counter(itertools.chain(*document))
word_counts = map_word_frequency((sentence1 + sentence2))发布于 2020-04-17 06:38:24
删除对itertools.chain的调用
from collections import Counter
from itertools import chain
sentence1 = [token for token in "hello how are you".split()]
sentence2 = [token for token in "i am fine thank you".split()]
def map_word_frequency(document):
return Counter(chain(*document))
word_counts = map_word_frequency([sentence1, sentence2])
print(word_counts)输出
Counter({'you': 2, 'hello': 1, 'how': 1, 'are': 1, 'i': 1, 'am': 1, 'fine': 1, 'thank': 1})从文档中可以看到以下示例:
chain('ABC', 'DEF') --> A B C D E F因此,当:
chain(*document)执行时,它将解压缩列表,并将列表的每个元素作为单独的参数传递。一个更具体的例子:
document = ['bad', 'bat', 'baby']
chain(*document)相当于:
chain('bad', 'bat, 'baby')如果要使用链,请删除级联sentence1 + sentence2,然后传递列表[sentence1, sentence2]列表,如:
def map_word_frequency(document):
return Counter(chain(*document))
word_counts = map_word_frequency([sentence1, sentence2])
print(word_counts)还请注意,对于上面的示例,最好使用可迭代,如下所示:
Counter(chain.from_iterable(document))发布于 2020-04-17 06:40:19
如果您想了解使用chain的意义,就必须像这样使用它:
Counter(itertools.chain(sentence1, sentence2))或
document = itertools.chain(sentence1, sentence2)
Counter(document)您使用的标准列表连接lst1 + lst2使chain过时。然后将其应用到一个未打包的字符串列表中,从而在字符上生成一个迭代器。顺便说一句,
[token for token in s.split()] 是一样的
s.split()https://stackoverflow.com/questions/61265456
复制相似问题