文章/答案/技术大牛

发布

社区首页 >问答首页 >TFIDF加权方案的实现

问TFIDF加权方案的实现
EN

Stack Overflow用户

提问于 2014-06-20 22:19:01

回答 1查看 496关注 0票数 0

我的目标是使用TFIDF加权方案，将文本txt与语料库中的每一项进行比较。

语料库=“学校男孩在读”，“谁在看漫画？”，“小男孩在读”“。

txt='James学校的男孩总是忙着读

以下是我的实现：

TFIDF=term频率-逆文档frequence=tf * log (n/df) n=number在语料库中-在本例中为3

import collections
from collections import Counter
from math import log

txt2=Counter(txt.split())
corpus2=[Counter(x.split()) for x in corpus]
def tfidf(doc,_corpus):
    dic=collections.defaultdict(int)
    for x in _corpus:
       for y in x:
          dic[y] +=1
    for x in doc:
       if x not in dic:dic[x]=1.
    return {x : doc[x] * log(3.0/dic[x])for x in doc}

txt_tfidf=tfidf(txt2, corpus2)
corpus_tfidf=[tfidf(x, corpus2) for x in corpus2]

结果

print txt_tfidf
    {'boy': 0.4054651081081644, 'school': 1.0986122886681098, 'busy': 1.0986122886681098, 'James': 1.0986122886681098,
     'is': 0.0, 'always': 1.0986122886681098, 'the': 0.4054651081081644, 'reading': 0.0}
for x in corpus_tfidf:
    print x
{'boy': 0.4054651081081644, 'the': 0.4054651081081644, 'reading': 0.0, 'school': 1.0986122886681098, 'is': 0.0}
{'a': 1.0986122886681098, 'is': 0.0, 'who': 1.0986122886681098, 'comic?': 1.0986122886681098, 'reading': 0.0}
{'boy': 0.4054651081081644, 'the': 0.4054651081081644, 'reading': 0.0, 'little': 1.0986122886681098, 'is': 0.0}

我不太确定我是否是对的，因为像詹姆斯和漫画家这样的罕见术语应该比学校这样的常用词有更高的TFIDF权重。

如有任何建议，将不胜感激。

python

text

tf-idf

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-06-21 07:40:30

首先，正如@confuser在评论中所说的，让txt放在语料库中并去掉以下代码：

for x in doc:
   if x not in dic:dic[x]=1.

在那之后，我想在你的代码中添加一个.，导致编码中的一个点，就像烹饪中的盐。;)

    for y in x:
        dic[y] += 1.

哦，我也在你的代码里看到了一些神奇的数字。对不起，他们让我很紧张，所以我们有：

return {x: doc[x] * log(len(_corpus) / dic[x]) for x in doc}

通过所有这些小小的修改，我们可以看到下面代码的结果：

import collections
from collections import Counter
from math import log

corpus = ['the school boy is reading', 'who is reading a comic?', 'the little boy is reading',
          'James the school boy is always busy reading']

txt = corpus[-1]

txt2 = Counter(txt.split())
corpus2 = [Counter(x.split()) for x in corpus]


def tfidf(doc, _corpus):
    dic = collections.defaultdict(int)
    for x in _corpus:
        for y in x:
            dic[y] += 1.
    return {x: doc[x] * log(len(_corpus) / dic[x]) for x in doc}


txt_tfidf = tfidf(txt2, corpus2)
corpus_tfidf = [tfidf(x, corpus2) for x in corpus2]

print txt_tfidf

在我看来，'boy'的tf_idf比'busy'少得多，这似乎很正常。你同意吗？

{'boy': 0.28768207245178085, 'school': 0.6931471805599453, 'busy': 1.3862943611198906, 'James': 1.3862943611198906, 'is': 0.0, 'always': 1.3862943611198906, 'the': 0.28768207245178085, 'reading': 0.0}

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/24336470

复制

相似问题

问TFIDF加权方案的实现
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TFIDF加权方案的实现EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问TFIDF加权方案的实现
EN