我的目标是使用TFIDF加权方案,将文本txt与语料库中的每一项进行比较。
语料库=“学校男孩在读”,“谁在看漫画?”,“小男孩在读”“。
txt='James学校的男孩总是忙着读
以下是我的实现:
TFIDF=term频率-逆文档frequence=tf * log (n/df) n=number在语料库中-在本例中为3
import collections
from collections import Counter
from math import log
txt2=Counter(txt.split())
corpus2=[Counter(x.split()) for x in corpus]
def tfidf(doc,_corpus):
dic=collections.defaultdict(int)
for x in _corpus:
for y in x:
dic[y] +=1
for x in doc:
if x not in dic:dic[x]=1.
return {x : doc[x] * log(3.0/dic[x])for x in doc}
txt_tfidf=tfidf(txt2, corpus2)
corpus_tfidf=[tfidf(x, corpus2) for x in corpus2]结果
print txt_tfidf
{'boy': 0.4054651081081644, 'school': 1.0986122886681098, 'busy': 1.0986122886681098, 'James': 1.0986122886681098,
'is': 0.0, 'always': 1.0986122886681098, 'the': 0.4054651081081644, 'reading': 0.0}
for x in corpus_tfidf:
print x
{'boy': 0.4054651081081644, 'the': 0.4054651081081644, 'reading': 0.0, 'school': 1.0986122886681098, 'is': 0.0}
{'a': 1.0986122886681098, 'is': 0.0, 'who': 1.0986122886681098, 'comic?': 1.0986122886681098, 'reading': 0.0}
{'boy': 0.4054651081081644, 'the': 0.4054651081081644, 'reading': 0.0, 'little': 1.0986122886681098, 'is': 0.0}我不太确定我是否是对的,因为像詹姆斯和漫画家这样的罕见术语应该比学校这样的常用词有更高的TFIDF权重。
如有任何建议,将不胜感激。
发布于 2014-06-21 07:40:30
首先,正如@confuser在评论中所说的,让txt放在语料库中并去掉以下代码:
for x in doc:
if x not in dic:dic[x]=1.在那之后,我想在你的代码中添加一个.,导致编码中的一个点,就像烹饪中的盐。;)
for y in x:
dic[y] += 1.哦,我也在你的代码里看到了一些神奇的数字。对不起,他们让我很紧张,所以我们有:
return {x: doc[x] * log(len(_corpus) / dic[x]) for x in doc}通过所有这些小小的修改,我们可以看到下面代码的结果:
import collections
from collections import Counter
from math import log
corpus = ['the school boy is reading', 'who is reading a comic?', 'the little boy is reading',
'James the school boy is always busy reading']
txt = corpus[-1]
txt2 = Counter(txt.split())
corpus2 = [Counter(x.split()) for x in corpus]
def tfidf(doc, _corpus):
dic = collections.defaultdict(int)
for x in _corpus:
for y in x:
dic[y] += 1.
return {x: doc[x] * log(len(_corpus) / dic[x]) for x in doc}
txt_tfidf = tfidf(txt2, corpus2)
corpus_tfidf = [tfidf(x, corpus2) for x in corpus2]
print txt_tfidf在我看来,'boy'的tf_idf比'busy'少得多,这似乎很正常。你同意吗?
{'boy': 0.28768207245178085, 'school': 0.6931471805599453, 'busy': 1.3862943611198906, 'James': 1.3862943611198906, 'is': 0.0, 'always': 1.3862943611198906, 'the': 0.28768207245178085, 'reading': 0.0}https://stackoverflow.com/questions/24336470
复制相似问题