文章/答案/技术大牛

发布

问斯坦福分词器简介
EN

Stack Overflow用户

提问于 2017-08-13 17:38:21

回答 1查看 4.6K关注 0票数 1

最近，我尝试使用斯坦福分词器来处理Python中的中文数据。但当我运行分割器时，我遇到了一些问题。下面是我在Python中输入的代码：

segmenter = StanfordSegmenter(path_to_jar = '/Applications/Python3.6/stanford-segmenter/stanford-segmenter.jar',
                              path_to_slf4j = '/Applications/Python3.6/stanford-segmenter/slf4j-api-1.7.25.jar',
                              path_to_sihan_corpora_dict = '/Applications/Python 3.6/stanford-segmenter/data',
                              path_to_model = '/Applications/Python 3.6/stanford-segmenter/data/pku.gz',
                              path_to_dict = '/Applications/Python 3.6/stanford-segmenter/data/dict-chris6.ser.gz'
                             )

处理过程似乎没有问题，因为我没有收到任何警告。然而，当我试图在一个句子中分割中文单词时，分词器没有工作。

sentence = u'这是斯坦福中文分词器测试'
segmenter.segment(sentence)

Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/stanford/nlp/ie/crf/CRFClassifier : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
segmenter.segment(sentence)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/stanford_segmenter.py", line 96, in segment
return self.segment_sents([tokens])
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/stanford_segmenter.py", line 123, in segment_sents
stdout = self._execute(cmd)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/stanford_segmenter.py", line 143, in _execute
cmd,classpath=self._stanford_jar, stdout=PIPE, stderr=PIPE)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/internals.py", line 134, in java
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed : ['/usr/bin/java', '-mx2g', '-cp', '/Applications/Python 3.6/stanford-segmenter/stanford-segmenter.jar:/Applications/Python 3.6/stanford-segmenter/slf4j-api-1.7.25.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-sighanCorporaDict', '/Applications/Python 3.6/stanford-segmenter/data', '-textFile', '/var/folders/j3/52_wq50j75jfk5ybg6krlw_w0000gn/T/tmpz6dqv1yf', '-sighanPostProcessing', 'true', '-keepAllWhitespaces', 'false', '-loadClassifier', '/Applications/Python 3.6/stanford-segmenter/data/pku.gz', '-serDictionary', '/Applications/Python 3.6/stanford-segmenter/data/dict-chris6.ser.gz', '-inputEncoding', 'UTF-8']

我正在使用Python3.6.2和Mac。我想知道我是否错过了重要的一步。有人能分享他们解决这个问题的经验吗？非常感谢。

python-3.x

nltk

stanford-nlp

回答 1

Stack Overflow用户

发布于 2017-08-14 06:47:39

TL;DR

请稍等片刻，等待NLTKv3.2.5，其中将有一个非常简单的接口，用于跨不同语言标准化的斯坦福令牌器。

StanfordSegmenter和StanfordTokenizer类将在v3.2.5中被废弃，请参见

首先升级您的nltk版本：

pip install -U nltk

下载并启动斯坦福CoreNLP服务器：

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001  -port 9001 -timeout 15000

然后在NLTK v3.2.5中：

>>> from nltk.tokenize.stanford import CoreNLPTokenizer
>>> sttok = CoreNLPTokenizer('http://localhost:9001')
>>> sttok.tokenize(u'我家没有电脑。')
['我家', '没有', '电脑', '。']

同时，如果您的NLTK版本是v3.2.4，您可以尝试如下：

from nltk.parse.corenlp import CoreNLPParser 
corenlp_parser = CoreNLPParser('http://localhost:9001', encoding='utf8')
result = corenlp_parser.api_call(text, {'annotators': 'tokenize,ssplit'})
tokens = [token['originalText'] or token['word'] for sentence in result['sentences'] for token in sentence['tokens']]
tokens

输出

['我家', '没有', '电脑', '。']

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45663121

复制

相似问题

问斯坦福分词器简介
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问斯坦福分词器简介EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问斯坦福分词器简介
EN