嗨,我正在努力了解scikit-学习如何计算矩阵中TF下手的分数:文档1,第6部分,“葡萄酒”:
test_doc = ['The wine was lovely', 'The red was delightful',
'Terrible choice of wine', 'We had a bottle of red']
# Create vectorizer
vec = TfidfVectorizer(stop_words='english')
# Feature vector
tfidf = vec.fit_transform(test_doc)
feature_names = vec.get_feature_names()
feature_matrix = tfidf.todense()
['bottle', 'choice', 'delightful', 'lovely', 'red', 'terrible', 'wine']
[[ 0. 0. 0. 0.78528828 0. 0. 0.6191303 ]
[ 0. 0. 0.78528828 0. 0.6191303 0. 0. ]
[ 0. 0.61761437 0. 0. 0. 0.61761437 0.48693426]
[ 0.78528828 0. 0. 0. 0.6191303 0. 0. ]]我用一个非常相似的问题的答案来自己计算:如何用科学工具计算the国防军-学习TfidfVectorizer,但是在他们的TFIDFVectorizer,norm=None中。
由于我使用的是norm=l2的默认设置,这与norm=None有什么不同,如何自己计算呢?
发布于 2022-10-16 15:32:22
经过一些计算之后:
从公式中计算出文档(tfidf.todense())的tfidf.todense:
TFIDF = tf(t,d) * idf(t,D)tf(t,d) = word-t出现在文档d中的次数(不要除以文档的总单词)。idf(t,D) = ln ( (1 + D) / (1 + df))- D = number of the documents
- df = We are looking into all documents, if word-t exist in a`document_i` and we add +1 to df (we dont care if word-t exist many times in a documents)参数范数
https://stackoverflow.com/questions/49824788
复制相似问题