首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >NLTK Python word_tokenize

NLTK Python word_tokenize
EN

Stack Overflow用户
提问于 2018-03-25 12:29:17
回答 1查看 385关注 0票数 0

我已经加载了一个包含6000行句子的txt文件。我尝试过split("/n")word_tokenize的句子,但我得到了以下错误:

代码语言:javascript
复制
Traceback (most recent call last):
  File "final.py", line 52, in <module>
    short_pos_words = word_tokenize(short_pos)
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 95, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 313, in _pair_iter
    for el in it:
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 312, in _pair_iter
    prev = next(it)
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 581, in _annotate_first_pass
    for aug_tok in tokens:
  File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 546, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-03-25 12:33:55

这个问题与文件内容的编码有关。假设您想将str解码为UTF-8 unicode

选项1(Python3中不推荐):

代码语言:javascript
复制
import sys
reload(sys)
sys.setdefaultencoding('utf8')

备选方案2:

当尝试打开文本文件时,将encode参数传递给open函数:

代码语言:javascript
复制
f = open('/path/to/txt/file', 'r+', encoding="utf-8")
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/49475847

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档