文章/答案/技术大牛

发布

社区首页 >问答首页 >dtm稀疏性取决于tf/tfidf，同一语料库

问dtm稀疏性取决于tf/tfidf，同一语料库
EN

Stack Overflow用户

提问于 2016-11-29 12:35:11

回答 1查看 331关注 0票数 2

有人能解释吗？

我的理解是：

tf >= 0 (absolute frequency value)

tfidf >= 0 (for negative idf, tf=0)



sparse entry = 0

nonsparse entry > 0

因此，在下面的代码创建的两个DTM中，精确的稀疏/非稀疏比例应该是相同的。

library(tm)
data(crude)

dtm <- DocumentTermMatrix(crude, control=list(weighting=weightTf))
dtm2 <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf))
dtm
dtm2

但是：

> dtm
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2255/23065**
Sparsity           : 91%
Maximal term length: 17
Weighting          : term frequency (tf)
> dtm2
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2215/23105**
Sparsity           : 91%
Maximal term length: 17
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

text-processing

tf-idf

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-11-29 13:23:00

稀疏程度可能不同。如果TF为零，则TF值为0；如果TF为零，则TF值为零；如果每个文档中出现项，则TF值为零。请考虑以下示例：

txts <- c("super World", "Hello World", "Hello super top world")
library(tm)
tf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTf))
tfidf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTfIdf))

inspect(tf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 8/4
# Sparsity           : 33%
# Maximal term length: 5
# Weighting          : term frequency (tf)
# 
#        Docs
# Terms   1 2 3
#   hello 0 1 1
#   super 1 0 1
#   top   0 0 1
#   world 1 1 1

inspect(tfidf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 5/7
# Sparsity           : 58%
# Maximal term length: 5
# Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
# 
#        Docs
# Terms           1         2         3
#   hello 0.0000000 0.2924813 0.1462406
#   super 0.2924813 0.0000000 0.1462406
#   top   0.0000000 0.0000000 0.3962406
#   world 0.0000000 0.0000000 0.0000000

“超级”一词在文件1中出现1次，其中有2项，而在3份文件中有2份：

1/2 * log2(3/2)
# [1] 0.2924813

“世界”一词在第3号文件中出现1次，其中有4个术语，在所有3份文件中都有：

1/4 * log2(3/3) # 1/4 * 0
# [1] 0

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40866079

复制

相似问题

问dtm稀疏性取决于tf/tfidf，同一语料库
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问dtm稀疏性取决于tf/tfidf，同一语料库EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问dtm稀疏性取决于tf/tfidf，同一语料库
EN