我有这样的代码:
# Remove words that appear less than X (e.g. 2) time(s)
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 2] for text in texts]现在,这是过滤掉所有的标记,其术语频率(在所有文本中总出现)在2以下,还是在文档频率(发生一次或多次发生的文本总数)低于2?
编辑:
# Get term frequencies (how many times a term occurs no matter what)
from collections import Counter
termfrequency = Counter()
for text in texts:
for token in text:
termfrequency[token] +=1
texts = [[token for token in text if termfrequency[token] > 2] for text in texts]
# Get document frequencies (in how many documents a term exists > 0 times)
from collections import Counter
documentfrequency = Counter()
for text in texts:
documentfrequency.update(set(text))
texts = [[token for token in text if documentfrequency[token] > 2] for text in texts]发布于 2017-09-03 08:36:04
我想计算一个单词在整个集合中发生的文档数量,不管它在任何特定文档中发生了多少次。
有一种方法可以做到:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in set(text):
# ^^^ set() only keeps one occurrence of each word
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 2] for text in texts]在这里使用defaultdict没有任何问题。但是,值得注意的是,collections模块有一个更适合当前任务的类。它被称为Counter
from collections import Counter
frequency = Counter()
for text in texts:
frequency.update(set(text))
texts = [[token for token in text if frequency[token] > 2] for text in texts]https://stackoverflow.com/questions/46021530
复制相似问题