文章/答案/技术大牛

发布

社区首页 >问答首页 >doc2vec时的‘’utf 8‘编解码错误

问doc2vec时的‘’utf 8‘编解码错误
EN

Stack Overflow用户

提问于 2017-06-07 07:00:26

回答 1查看 325关注 0票数 1

无法运行程序的解码错误。实际上，我正在使用gensim并尝试Doc2vec库，在这样做时，我得到了这个错误？代码：-

def to_array(self):
    self.sentences = []
    for source, prefix in self.sources.items():
        with utils.smart_open(source) as fin:
            for item_no, line in enumerate(fin):
                self.sentences.append(LabeledSentence(
                    utils.to_unicode(line).split(), [prefix + '_%s' % 
item_no]))
    return self.sentences

sentences = LabeledLineSentence(sources)
model = Doc2Vec(min_count=1, window=10, size=100, dm_mean=0, sample=1e-5, 
negative=5, workers=12)
model.build_vocab(sentences.to_array())

错误：-

File "<ipython-input-88-eab20df20acc>", line 75, in <module>
model.build_vocab(sentences.to_array())

File "<ipython-input-88-eab20df20acc>", line 58, in to_array
utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))

File "C:\Users\summert\AppData\Local\Continuum\Anaconda3\lib\site-
packages\gensim\utils.py", line 235, in any2unicode
return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 21: 
invalid continuation byt

python-3.x

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-06-07 09:32:53

看起来这个anaconda gensim程序在想要utf-8时得到一个字节。model.build_vocab(sentences.to_array())没有得到它想要的类型。

你在哪里找到to_unicode的？从哪里进口的“功利品”？我不认为这是常规的Python3。请看一下this。

考虑到您正在使用Python 3，您可能不需要任何东西。

只需替换

(LabeledSentence(utils.to_unicode(line).split()...

使用

(LabeledSentence(line.split()...

如果这不起作用，试着：

 (LabeledSentence(line.encode('utf-8').split()...

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44405685

复制

相似问题

问doc2vec时的‘’utf 8‘编解码错误
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问doc2vec时的‘’utf 8‘编解码错误EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问doc2vec时的‘’utf 8‘编解码错误
EN