文章/答案/技术大牛

发布

社区首页 >问答首页 >如果没有ssplit，我如何使用pycorenlp (斯坦福CoreNLP的Python包装器)来执行文本的字标记化？

问如果没有ssplit，我如何使用pycorenlp (斯坦福CoreNLP的Python包装器)来执行文本的字标记化？
EN

Stack Overflow用户

提问于 2016-08-12 22:45:49

回答 1查看 669关注 0票数 1

我正在尝试运行吡咯烷酮，这是一个用于斯坦福大学CoreNLP的Python包装器，它使用tokenize注解器执行文本的字标记化。

我第一次推出斯坦福大学的CoreNLP：

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000

然后跑：

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

text_input = 'this is a test.'
print('text_input: {0}'.format(text_input))
text_output = nlp.annotate(text_input, properties={
                    'annotators': 'tokenize',
                    'outputFormat': 'json'
                })
print('text_output: {0}'.format(text_output))

令人惊讶的是，这没有提供任何输出：

text_input: this is a test.
text_output: {}

为什么？

如果我添加ssplit，那么text_output不再是空的：

text_input = 'this is a test.'
print('text_input: {0}'.format(text_input))
text_output = nlp.annotate(text_input, properties={
                    'annotators': 'tokenize,ssplit',
                    'outputFormat': 'json'
                })
print('text_output: {0}'.format(text_output))

产出：

text_input: this is a test.
text_output: {u'sentences': [{u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE', u'index': 0, u'tokens': [{u'index': 1, u'word': u'this', u'after': u' ', u'characterOffsetEnd': 4, u'characterOffsetBegin': 0, u'originalText': u'this', u'before': u''}, {u'index': 2, u'word': u'is', u'after': u' ', u'characterOffsetEnd': 7, u'characterOffsetBegin': 5, u'originalText': u'is', u'before': u' '}, {u'index': 3, u'word': u'a', u'after': u' ', u'characterOffsetEnd': 9, u'characterOffsetBegin': 8, u'originalText': u'a', u'before': u' '}, {u'index': 4, u'word': u'test', u'after': u'', u'characterOffsetEnd': 14, u'characterOffsetBegin': 10, u'originalText': u'test', u'before': u' '}, {u'index': 5, u'word': u'.', u'after': u'', u'characterOffsetEnd': 15, u'characterOffsetBegin': 14, u'originalText': u'.', u'before': u''}]}]}

我不能不用使用tokenize注释器就使用ssplit注解器吗？

注释器依赖项概述似乎说我应该能够单独使用tokenize注解器：

stanford-nlp

tokenize

python

nlp

回答 1

Stack Overflow用户

发布于 2016-08-13 09:43:17

您是对的，如果提供的唯一注解器是“tokenize”，则API似乎没有响应。它应该默认为PTBTokenizer，就像文档中提到的那样。这里还有另一个相关的问题：斯坦福CoreNLP给NullPointerException。但是，如果您只想标记而不做其他操作，则可以：

nwani@ip-172-31-43-96:~/stanford-corenlp-full-2015-12-09$ ~/jre1.8.0_101/bin/java -mx4g -cp "*" edu.stanford.nlp.process.PTBTokenizer <<< "this is a test"
this
is
a
test
PTBTokenizer tokenized 4 tokens at 19.11 tokens per second.

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/38927415

复制

相似问题

问如果没有ssplit，我如何使用pycorenlp (斯坦福CoreNLP的Python包装器)来执行文本的字标记化？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如果没有ssplit，我如何使用pycorenlp (斯坦福CoreNLP的Python包装器)来执行文本的字标记化？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如果没有ssplit，我如何使用pycorenlp (斯坦福CoreNLP的Python包装器)来执行文本的字标记化？
EN