有人能解释吗?
我的理解是:
tf >= 0 (absolute frequency value)
tfidf >= 0 (for negative idf, tf=0)
sparse entry = 0
nonsparse entry > 0因此,在下面的代码创建的两个DTM中,精确的稀疏/非稀疏比例应该是相同的。
library(tm)
data(crude)
dtm <- DocumentTermMatrix(crude, control=list(weighting=weightTf))
dtm2 <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf))
dtm
dtm2但是:
> dtm
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2255/23065**
Sparsity : 91%
Maximal term length: 17
Weighting : term frequency (tf)
> dtm2
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2215/23105**
Sparsity : 91%
Maximal term length: 17
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)发布于 2016-11-29 13:23:00
稀疏程度可能不同。如果TF为零,则TF值为0;如果TF为零,则TF值为零;如果每个文档中出现项,则TF值为零。请考虑以下示例:
txts <- c("super World", "Hello World", "Hello super top world")
library(tm)
tf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTf))
tfidf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTfIdf))
inspect(tf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 8/4
# Sparsity : 33%
# Maximal term length: 5
# Weighting : term frequency (tf)
#
# Docs
# Terms 1 2 3
# hello 0 1 1
# super 1 0 1
# top 0 0 1
# world 1 1 1
inspect(tfidf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 5/7
# Sparsity : 58%
# Maximal term length: 5
# Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
#
# Docs
# Terms 1 2 3
# hello 0.0000000 0.2924813 0.1462406
# super 0.2924813 0.0000000 0.1462406
# top 0.0000000 0.0000000 0.3962406
# world 0.0000000 0.0000000 0.0000000“超级”一词在文件1中出现1次,其中有2项,而在3份文件中有2份:
1/2 * log2(3/2)
# [1] 0.2924813“世界”一词在第3号文件中出现1次,其中有4个术语,在所有3份文件中都有:
1/4 * log2(3/3) # 1/4 * 0
# [1] 0https://stackoverflow.com/questions/40866079
复制相似问题