文章/答案/技术大牛

发布

社区首页 >问答首页 >与gensim.models.Phrases的问题

问与gensim.models.Phrases的问题
EN

Stack Overflow用户

提问于 2017-08-23 18:52:07

回答 1查看 800关注 0票数 1

from gensim.parsing import PorterStemmer
from gensim.models import Word2Vec, Phrases

class SentenceClass(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            with open(os.path.join(self.dirname,fname), 'r') as myfile:
                doc = myfile.read().replace('\n', ' ')
                for sent in tokenize.sent_tokenize(doc.lower()):
                    yield [Stemming.stem(word)\
                    for word in word_tokenize(re.sub("[^A-Za-z]", " ",sent))\
                    if word not in stopwords]

现在有两种方法：

model = Word2Vec(SentenceClass(data_dir_path), size=100, window=5, min_count=1, workers=4)

上面的那个没什么预兆就跑得很好

bigram_transformer = Phrases(SentenceClass(data_dir_path), min_count=1)
model = Word2Vec(bigram_transformer[SentenceClass(data_dir_path)], size=100, window=5, min_count=1, workers=4)

产生警告：

WARNING:gensim.models.word2vec:train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable).
WARNING:gensim.models.word2vec:supplied example count (0) did not equal expected count (30)

现在我确实理解了生成器和迭代器之间的区别，我正在传递迭代器，通过多次打印下面的命令来验证这一点：

print(list(SentenceClass(data_dir_path)))
print(list(SentenceClass(data_dir_path)))
print(list(bigram_transformer[SentenceClass(data_dir_path)]))
print(list(bigram_transformer[SentenceClass(data_dir_path)]))

它可以很好地打印东西，但是我仍然不知道为什么第二种情况下“空迭代器”的警告，我在这里遗漏了什么吗？

python

nlp

gensim

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-08-23 20:00:37

我意识到短语和短语都只是一个生成器，需要上下面的课

from gensim.models import Word2Vec, Phrases, phrases

class PhraseItertor(object):

    def __init__(self, my_phraser, data):
        self.my_phraser, self.data = my_phraser, data

    def __iter__(self):
        return self.my_phraser[self.data]

my_sentences = SentenceClass(data_dir_path)
my_phrases = Phrases(my_sentences, min_count=1)
bigram = phrases.Phraser(my_phrases)
my_corpus = PhraseItertor(bigram,my_sentences)

model = Word2Vec(my_corpus, size=100, window=5, min_count=1, workers=4)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45847370

复制

相似问题

问与gensim.models.Phrases的问题
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问与gensim.models.Phrases的问题EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问与gensim.models.Phrases的问题
EN