首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python LSI使用gensim不起作用

Python LSI使用gensim不起作用
EN

Stack Overflow用户
提问于 2014-02-01 21:28:15
回答 3查看 1.9K关注 0票数 1

我正在尝试根据主题行对电子邮件进行分类,为了训练分类器,我必须获得LSI。我正在获取tf-idf,并进一步尝试获取LSI模型。但是,它根本不会对任何文件进行任何处理/写入。我的代码如下:

代码语言:javascript
复制
#reading the list of subjects for features
f = open('subject1000.csv','rb')
f500 = open('subject500.csv','wb')

with open('subject1000.csv') as myfile:
    head=list(islice(myfile,500))#only 500 subjects for training

for h in head:
    f500.write(h)
    #print h

f500.close()    
texts = (line.lower().split() for line in head) #creating texts of subjects

dictionary = corpora.Dictionary(texts) #all the words used to create dictionary
dictionary.compactify()
print dictionary #checkpoint - 2215 unique tokens -- 2215 unique words to 1418 for 500 topics

#corpus streaming 
class MyCorpus(object):
    def __iter__(self):
        for line in open('subject500.csv','rb'): #supposed to be one document per line -- open('subject1000.csv','rb')
            yield dictionary.doc2bow(line.lower().split())  #every line - converted to bag-of-words format = list of (token_id, token_count) 2-tuples          
print 'corpus created'
corpus = MyCorpus() # object created

for vector in corpus:
    print vector

tfidf = models.TfidfModel(corpus)
corpus_tfidf= tfidf[corpus]  #re-initialize the corpus according to the model to get the normalized frequencies.
corpora.MmCorpus.serialize('subject500-tfidf', corpus_tfidf)  #store to disk for later use

print 'TFIDF complete!' #check - till here its ok

lsi300 = models.LsiModel(corpus_tfidf, num_topics=300, id2word=dictionary) #using the trained corpus to use LSI indexing
corpus_lsi300 = lsi300[corpus_tfidf]
print corpus_lsi300 #checkpoint
lsi300.print_topics(10,5) #checks
corpora.BleiCorpus.serialize('subjects500-lsi-300', corpus_lsi300)

我得到输出,直到'TFIDF完成!‘但是,程序不会为LSI返回任何内容。我正在运行上面的500个主题行。任何关于可能出错的想法都将非常感谢!谢谢。

记录的数据如下:

代码语言:javascript
复制
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens)
INFO:gensim.corpora.dictionary:built Dictionary(1418 unique tokens) from 500 documents (total 3109 corpus positions)
DEBUG:gensim.corpora.dictionary:rebuilding dictionary, shrinking gaps
INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:calculating IDF weights for 500 documents and 1418 features (3081 matrix non-zeros)
INFO:gensim.corpora.mmcorpus:storing corpus in Matrix Market format to subject500-tfidf
INFO:gensim.matutils:saving sparse matrix to subject500-tfidf
INFO:gensim.matutils:PROGRESS: saving document #0
INFO:gensim.matutils:saved 500x1418 matrix, density=0.435% (3081/709000)
DEBUG:gensim.matutils:closing subject500-tfidf
DEBUG:gensim.matutils:closing subject500-tfidf
INFO:gensim.corpora.indexedcorpus:saving MmCorpus index to subject500-tfidf.index
INFO:gensim.models.lsimodel:using serial LSI version on this node
INFO:gensim.models.lsimodel:updating model with new documents
INFO:gensim.models.lsimodel:preparing a new chunk of documents
DEBUG:gensim.models.lsimodel:converting corpus to csc format
INFO:gensim.models.lsimodel:using 100 extra samples and 2 power iterations
INFO:gensim.models.lsimodel:1st phase: constructing (1418, 400) action matrix
INFO:gensim.models.lsimodel:orthonormalizing (1418, 400) action matrix
DEBUG:gensim.matutils:computing QR of (1418, 400) dense matrix
DEBUG:gensim.models.lsimodel:running 2 power iterations
DEBUG:gensim.matutils:computing QR of (1418, 400) dense matrix
DEBUG:gensim.matutils:computing QR of (1418, 400) dense matrix
INFO:gensim.models.lsimodel:2nd phase: running dense svd on (400, 500) matrix
EN

回答 3

Stack Overflow用户

发布于 2014-02-02 01:43:09

使用添加日志记录

代码语言:javascript
复制
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

并在此处粘贴日志或gist链接。

票数 1
EN

Stack Overflow用户

发布于 2014-03-17 23:50:34

我在阅读Gensim教程时遇到了同样的问题。使用一个包含2000个文档的样本语料库,我尝试将其转换为LSI。在"running dense SVD“步骤中,Python崩溃并显示Windows错误消息"Python停止工作”。它在较小的语料库上工作得很好。该问题似乎是使用win32的当前二进制文件不正确地安装了scipy。在安装Anaconda (一个包含numpy和scipy的python发行版)之后,这个问题就消失了。

票数 0
EN

Stack Overflow用户

发布于 2014-06-03 03:47:03

本周早些时候我遇到了一个类似的问题,我的模型加载正确,但打印主题不起任何作用。我发现这可能是print_topics()行为的一个错误--如果你在命令行上运行它,它将静音输出,而如果你在iPython中运行它,或者显式地循环打印主题,你应该会看到你的结果。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/21498633

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档