问按文档频率删除令牌
EN

Stack Overflow用户

提问于 2017-09-03 08:19:11

回答 1查看 865关注 0票数 0

我有这样的代码：

# Remove words that appear less than X (e.g. 2) time(s)
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 2] for text in texts]

现在，这是过滤掉所有的标记，其术语频率(在所有文本中总出现)在2以下，还是在文档频率(发生一次或多次发生的文本总数)低于2？

编辑：

# Get term frequencies (how many times a term occurs no matter what)

from collections import Counter
termfrequency = Counter()
for text in texts:
    for token in text:
        termfrequency[token] +=1

texts = [[token for token in text if termfrequency[token] > 2] for text in texts]

# Get document frequencies (in how many documents a term exists > 0 times)

from collections import Counter
documentfrequency = Counter()
for text in texts:
    documentfrequency.update(set(text))

texts = [[token for token in text if documentfrequency[token] > 2] for text in texts]

collections

python

回答 1

Stack Overflow用户

发布于 2017-09-03 08:36:04

我想计算一个单词在整个集合中发生的文档数量，不管它在任何特定文档中发生了多少次。

有一种方法可以做到：

from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in set(text):
               # ^^^ set() only keeps one occurrence of each word
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 2] for text in texts]

在这里使用defaultdict没有任何问题。但是，值得注意的是，collections模块有一个更适合当前任务的类。它被称为Counter

from collections import Counter
frequency = Counter()
for text in texts:
    frequency.update(set(text))
texts = [[token for token in text if frequency[token] > 2] for text in texts]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46021530

复制

相似问题

问按文档频率删除令牌
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问按文档频率删除令牌EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问按文档频率删除令牌
EN