我是第一次接触NLP,请澄清一下TFIDF值是如何使用fit_transform转换的。
下面计算IDF的公式是工作正常,log (文档总数+1/术语出现数+ 1) +1
例如:文档1中术语"This“的idf值(”this is a string“是1.91629073
应用fit_transform后,所有术语的值都更改了,转换所使用的公式\逻辑是什么
TFID = TF * IDF
例如:文档1中的术语"This“("this is a string")的TFIDF值为0.61366674
这个值是如何到达的,0.61366674?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
d = pd.Series(['This is a string','This is another string',
'TFIDF Computation Calculation','TFIDF is the product of TF and IDF'])
df = pd.DataFrame(d)
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df[0])
print (tfidf_vectorizer.idf_)
#output
#[1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.22314355 1.91629073
#1.91629073 1.51082562 1.91629073 1.51082562 1.91629073 1.51082562]
##-------------------------------------------------
##how the above values are getting transformed here
##-------------------------------------------------
print (tfidf.toarray())
#[[0. 0. 0. 0. 0. 0.49681612 0.
#0. 0.61366674 0. 0. 0. 0.61366674]
# [0. 0.61422608 0. 0. 0. 0.39205255
# 0. 0. 0.4842629 0. 0. 0. 0.4842629 ]
# [0. 0. 0.61761437 0.61761437 0. 0.
# 0. 0. 0. 0. 0.48693426 0. 0. ]
# [0.37718389 0. 0. 0. 0.37718389 0.24075159
# 0.37718389 0.37718389 0. 0.37718389 0.29737611 0.37718389 0. ]]发布于 2019-03-04 21:19:08
它被称为TF-IDF向量,因为在默认情况下,根据documentation的norm='l2'。因此,在tfidf.toarray()的输出中,数组第0级/行的每个元素表示一个文档,第1级/列的每个元素表示一个唯一的单词,每个文档的向量元素的平方和等于1,您可以通过打印print([sum([word ** 2 for word in doc]) for doc in tfidf.toarray()])进行检查。
范数:‘L1’,‘l2’或None,可选(默认值=‘L2’)每个输出行都有单位范数,可以是:*‘L2’:向量元素的平方和是1。当应用l2范数时,两个向量之间的余弦相似度是它们的点积。*‘L1’:向量元素的绝对值之和为1。请参阅preprocessing.normalize
print(tfidf) #the same values you find in tfidf.toarray() but more readable
output: ([index of document on array lvl 0 / row], [index of unique word on array lvl 1 / column]) normed TF-IDF value
(0, 12) 0.6136667440107333 #1st word in 1st sentence: 'This'
(0, 5) 0.4968161174826459 #'is'
(0, 8) 0.6136667440107333 #'string', see that word 'a' is missing
(1, 12) 0.48426290003607125 #'This'
(1, 5) 0.3920525532545391 #'is'
(1, 8) 0.48426290003607125 #'string'
(1, 1) 0.6142260844216119 #'another'
(2, 10) 0.48693426407352264 #'TFIDF'
(2, 3) 0.6176143709756019 #'Computation'
(2, 2) 0.6176143709756019 #'Calculation'
(3, 5) 0.2407515909314943 #'is'
(3, 10) 0.2973761110467491 #'TFIDF'
(3, 11) 0.37718388973255157 #'the'
(3, 7) 0.37718388973255157 #'product'
(3, 6) 0.37718388973255157 #'of'
(3, 9) 0.37718388973255157 #'TF'
(3, 0) 0.37718388973255157 #'and'
(3, 4) 0.37718388973255157 #'IDF'因为它是TF-IDF值,所以向量元素的平方和将等于1。例如,对于索引为0的第一个文档,向量元素的平方和将等于1:sum([0.6136667440107333 ** 2, 0.4968161174826459 ** 2, 0.6136667440107333 ** 2])
可以通过设置norm=None来禁用此转换。
print(TfidfVectorizer(norm=None).fit_transform(df[0])) #the same values you find in TfidfVectorizer(norm=None).fit_transform(df[0]).toarray(), but more readable
output: ([index of document on array lvl 0 / row], [index of unique word on array lvl 1 / column]) TF-IDF value
(0, 12) 1.5108256237659907 #1st word in 1st sentence: 'This'
(0, 5) 1.2231435513142097 #'is'
(0, 8) 1.5108256237659907 #'string', see that word 'a' is missing
(1, 12) 1.5108256237659907 #'This'
(1, 5) 1.2231435513142097 #'is'
(1, 8) 1.5108256237659907 #'string'
(1, 1) 1.916290731874155 #'another'
(2, 10) 1.5108256237659907 #'TFIDF'
(2, 3) 1.916290731874155 #'Computation'
(2, 2) 1.916290731874155 #'Calculation'
(3, 5) 1.2231435513142097 #'is'
(3, 10) 1.5108256237659907 #'TFIDF'
(3, 11) 1.916290731874155 #'the'
(3, 7) 1.916290731874155 #'product'
(3, 6) 1.916290731874155 #'of'
(3, 9) 1.916290731874155 #'TF'
(3, 0) 1.916290731874155 #'and'
(3, 4) 1.916290731874155 #'IDF'因为每个单词在每个文档中只出现一次,所以TF-IDF值是每个单词的IDF值乘以1:
tfidf_vectorizer = TfidfVectorizer(norm=None)
tfidf = tfidf_vectorizer.fit_transform(df[0])
print(tfidf_vectorizer.idf_)
output: Smoothed IDF-values
[1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.22314355
1.91629073 1.91629073 1.51082562 1.91629073 1.51082562 1.91629073
1.51082562]我希望以上内容能对你有所帮助。
不幸的是,我不能重现转换,因为
当应用l2范数时,两个向量之间的余弦相似度是它们的点积。
。
似乎是额外的一步。因为在使用默认设置norm='l2'时,TF-IDF值会受到每个文档中的字数的影响,所以我只需使用norm=None就可以关闭这个设置。我发现,您不能简单地使用以下命令进行转换:
tfidf_norm_calculated = [
[(word/sum(doc))**0.5 for word in doc]
for doc in TfidfVectorizer(norm=None).fit_transform(df[0]).toarray()]
print(tfidf_norm_calculated)
print('Sum of squares of vector elements is 1: ', [sum([word**2 for word in doc]) for doc in tfidf_norm_calculated])
print('Compare to:', TfidfVectorizer().fit_transform(df[0]).toarray())https://stackoverflow.com/questions/54969339
复制相似问题