嗨,我有一个像lemma所示的格式的引线文本。我想得到每个单词的TfIdf分数--这是我写的函数:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
lemma=["'Ah", 'yes', u'say', 'softly', 'Harry',
'Potter', 'Our', 'new', 'celebrity', 'You',
'learn', 'subtle', 'science', 'exact', 'art',
'potion-making', u'begin', 'He', u'speak', 'barely',
'whisper', 'caught', 'every', 'word', 'like',
'Professor', 'McGonagall', 'Snape', 'gift',
u'keep', 'class', 'silent', 'without', 'effort',
'As', 'little', 'foolish', 'wand-waving', 'many',
'hardly', 'believe', 'magic', 'I', 'dont', 'expect', 'really',
'understand', 'beauty']
def Tfidf_Vectorize(lemmas_name):
vect = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
vect_transform = vect.fit_transform(lemmas_name)
# First approach of creating a dataframe of weight & feature names
vect_score = np.asarray(vect_transform.mean(axis=0)).ravel().tolist()
vect_array = pd.DataFrame({'term': vect.get_feature_names(), 'weight': vect_score})
vect_array.sort_values(by='weight',ascending=False,inplace=True)
# Second approach of getting the feature names
vect_fn = np.array(vect.get_feature_names())
sorted_tfidf_index = vect_transform.max(0).toarray()[0].argsort()
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))
return vect_array
tf_dataframe=Tfidf_Vectorize(lemma)
print(tf_dataframe.iloc[:5,:])我正在完成的输出:
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))是
Largest Tfidf:
[u'yes' u'fools' u'fury' u'gale' u'ghosts' u'gift' u'glory' u'glow' u'good'
u'granger']tf_dataframe的结果
term weight
261 snape 0.027875
238 say 0.022648
211 potter 0.013937
181 mind 0.010453
123 harry 0.010453
60 dark 0.006969
75 dumbledore 0.006969
311 voice 0.005226
125 head 0.005226
231 ron 0.005226这两种方法不应该导致顶级特性的相同结果吗?我只想计算tfidf的分数,并得到前5项功能/重量。我做错什么了?
发布于 2017-12-21 20:43:12
我不知道我在看什么,但我觉得你使用TfidfVectorizer是不正确的。但是,请纠正我,以防我对你的尝试有了错误的理解。
所以..。您需要的是提供给fit_transform()的文档列表。由此,您可以构造一个矩阵,例如,每一列代表一个文档,每一行表示一个单词。该矩阵中的一个单元格是j文件中单词I的tf-国防军分数。
下面是一个例子:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"This is a document.",
"This is another document with slightly more text.",
"Whereas this is yet another document with even more text than the other ones.",
"This document is awesome and also rather long.",
"The car he drove was red."
]
document_names = ['Doc {:d}'.format(i) for i in range(len(documents))]
def get_tfidf(docs, ngram_range=(1,1), index=None):
vect = TfidfVectorizer(stop_words='english', ngram_range=ngram_range)
tfidf = vect.fit_transform(documents).todense()
return pd.DataFrame(tfidf, columns=vect.get_feature_names(), index=index).T
print(get_tfidf(documents, ngram_range=(1,2), index=document_names))这会给你:
Doc 0 Doc 1 Doc 2 Doc 3 Doc 4
awesome 0.0 0.000000 0.000000 0.481270 0.000000
awesome long 0.0 0.000000 0.000000 0.481270 0.000000
car 0.0 0.000000 0.000000 0.000000 0.447214
car drove 0.0 0.000000 0.000000 0.000000 0.447214
document 1.0 0.282814 0.282814 0.271139 0.000000
document awesome 0.0 0.000000 0.000000 0.481270 0.000000
document slightly 0.0 0.501992 0.000000 0.000000 0.000000
document text 0.0 0.000000 0.501992 0.000000 0.000000
drove 0.0 0.000000 0.000000 0.000000 0.447214
drove red 0.0 0.000000 0.000000 0.000000 0.447214
long 0.0 0.000000 0.000000 0.481270 0.000000
ones 0.0 0.000000 0.501992 0.000000 0.000000
red 0.0 0.000000 0.000000 0.000000 0.447214
slightly 0.0 0.501992 0.000000 0.000000 0.000000
slightly text 0.0 0.501992 0.000000 0.000000 0.000000
text 0.0 0.405004 0.405004 0.000000 0.000000
text ones 0.0 0.000000 0.501992 0.000000 0.000000您展示的两种方法分别用来获取单词和它们各自的分数,分别计算所有文档的平均值,并分别获取每个单词的最大得分。
因此,让我们这样做并比较这两种方法:
df = get_tfidf(documents, ngram_range=(1,2), index=index)
print(pd.DataFrame([df.mean(1), df.max(1)], index=['score_mean', 'score_max']).T)我们可以看到,分数当然是不同的。
score_mean score_max
awesome 0.096254 0.481270
awesome long 0.096254 0.481270
car 0.089443 0.447214
car drove 0.089443 0.447214
document 0.367353 1.000000
document awesome 0.096254 0.481270
document slightly 0.100398 0.501992
document text 0.100398 0.501992
drove 0.089443 0.447214
drove red 0.089443 0.447214
long 0.096254 0.481270
ones 0.100398 0.501992
red 0.089443 0.447214
slightly 0.100398 0.501992
slightly text 0.100398 0.501992
text 0.162002 0.405004
text ones 0.100398 0.501992注:
您可以说服自己,这与在TfidfVectorizer上调用min/max一样。
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tfidf = vect.fit_transform(documents)
print(tfidf.max(0))
print(tfidf.mean(0))https://stackoverflow.com/questions/47932178
复制相似问题