文章/答案/技术大牛

发布

问TfIDf向量器权值
EN

Stack Overflow用户

提问于 2017-12-21 20:18:21

回答 1查看 3.5K关注 0票数 1

嗨，我有一个像lemma所示的格式的引线文本。我想得到每个单词的TfIdf分数--这是我写的函数：

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

lemma=["'Ah", 'yes', u'say', 'softly', 'Harry', 
       'Potter', 'Our', 'new', 'celebrity', 'You', 
       'learn', 'subtle', 'science', 'exact', 'art', 
       'potion-making', u'begin', 'He', u'speak', 'barely', 
       'whisper', 'caught', 'every', 'word', 'like', 
       'Professor', 'McGonagall', 'Snape', 'gift', 
       u'keep', 'class', 'silent', 'without', 'effort', 
       'As', 'little', 'foolish', 'wand-waving', 'many', 
       'hardly', 'believe', 'magic', 'I', 'dont', 'expect', 'really', 
       'understand', 'beauty']

def Tfidf_Vectorize(lemmas_name):

    vect = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
    vect_transform = vect.fit_transform(lemmas_name)    

    # First approach of creating a dataframe of weight & feature names

    vect_score = np.asarray(vect_transform.mean(axis=0)).ravel().tolist()
    vect_array = pd.DataFrame({'term': vect.get_feature_names(), 'weight': vect_score})
    vect_array.sort_values(by='weight',ascending=False,inplace=True)

    # Second approach of getting the feature names

    vect_fn = np.array(vect.get_feature_names())    
    sorted_tfidf_index = vect_transform.max(0).toarray()[0].argsort()

    print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))

    return vect_array

tf_dataframe=Tfidf_Vectorize(lemma)
print(tf_dataframe.iloc[:5,:])

我正在完成的输出：

print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))

是

Largest Tfidf: 
[u'yes' u'fools' u'fury' u'gale' u'ghosts' u'gift' u'glory' u'glow' u'good'
 u'granger']

tf_dataframe的结果

       term  weight
261  snape       0.027875
238  say         0.022648
211  potter      0.013937
181  mind        0.010453
123  harry       0.010453
60   dark        0.006969
75   dumbledore  0.006969
311  voice       0.005226
125  head        0.005226
231  ron         0.005226

这两种方法不应该导致顶级特性的相同结果吗？我只想计算tfidf的分数，并得到前5项功能/重量。我做错什么了？

nlp

nltk

data-analysis

tf-idf

python

回答 1

Stack Overflow用户

发布于 2017-12-21 20:43:12

我不知道我在看什么，但我觉得你使用TfidfVectorizer是不正确的。但是，请纠正我，以防我对你的尝试有了错误的理解。

所以..。您需要的是提供给fit_transform()的文档列表。由此，您可以构造一个矩阵，例如，每一列代表一个文档，每一行表示一个单词。该矩阵中的一个单元格是j文件中单词I的tf-国防军分数。

下面是一个例子：

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "This is a document.",
    "This is another document with slightly more text.",
    "Whereas this is yet another document with even more text than the other ones.",
    "This document is awesome and also rather long.",
    "The car he drove was red."
]

document_names = ['Doc {:d}'.format(i) for i in range(len(documents))]

def get_tfidf(docs, ngram_range=(1,1), index=None):
    vect = TfidfVectorizer(stop_words='english', ngram_range=ngram_range)
    tfidf = vect.fit_transform(documents).todense()
    return pd.DataFrame(tfidf, columns=vect.get_feature_names(), index=index).T

print(get_tfidf(documents, ngram_range=(1,2), index=document_names))

这会给你：

                    Doc 0     Doc 1     Doc 2     Doc 3     Doc 4
awesome               0.0  0.000000  0.000000  0.481270  0.000000
awesome long          0.0  0.000000  0.000000  0.481270  0.000000
car                   0.0  0.000000  0.000000  0.000000  0.447214
car drove             0.0  0.000000  0.000000  0.000000  0.447214
document              1.0  0.282814  0.282814  0.271139  0.000000
document awesome      0.0  0.000000  0.000000  0.481270  0.000000
document slightly     0.0  0.501992  0.000000  0.000000  0.000000
document text         0.0  0.000000  0.501992  0.000000  0.000000
drove                 0.0  0.000000  0.000000  0.000000  0.447214
drove red             0.0  0.000000  0.000000  0.000000  0.447214
long                  0.0  0.000000  0.000000  0.481270  0.000000
ones                  0.0  0.000000  0.501992  0.000000  0.000000
red                   0.0  0.000000  0.000000  0.000000  0.447214
slightly              0.0  0.501992  0.000000  0.000000  0.000000
slightly text         0.0  0.501992  0.000000  0.000000  0.000000
text                  0.0  0.405004  0.405004  0.000000  0.000000
text ones             0.0  0.000000  0.501992  0.000000  0.000000

您展示的两种方法分别用来获取单词和它们各自的分数，分别计算所有文档的平均值，并分别获取每个单词的最大得分。

因此，让我们这样做并比较这两种方法：

df = get_tfidf(documents, ngram_range=(1,2), index=index)

print(pd.DataFrame([df.mean(1), df.max(1)], index=['score_mean', 'score_max']).T)

我们可以看到，分数当然是不同的。

                   score_mean  score_max
awesome              0.096254   0.481270
awesome long         0.096254   0.481270
car                  0.089443   0.447214
car drove            0.089443   0.447214
document             0.367353   1.000000
document awesome     0.096254   0.481270
document slightly    0.100398   0.501992
document text        0.100398   0.501992
drove                0.089443   0.447214
drove red            0.089443   0.447214
long                 0.096254   0.481270
ones                 0.100398   0.501992
red                  0.089443   0.447214
slightly             0.100398   0.501992
slightly text        0.100398   0.501992
text                 0.162002   0.405004
text ones            0.100398   0.501992

注：

您可以说服自己，这与在TfidfVectorizer上调用min/max一样。

vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tfidf = vect.fit_transform(documents)
print(tfidf.max(0))
print(tfidf.mean(0))

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47932178

复制

相似问题

问TfIDf向量器权值
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TfIDf向量器权值EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TfIDf向量器权值
EN