首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >NLTK PunktSentenceTokenizer椭圆分裂

NLTK PunktSentenceTokenizer椭圆分裂
EN

Stack Overflow用户
提问于 2015-04-30 14:47:04
回答 1查看 1.5K关注 0票数 3

我正在使用NLTK PunktSentenceTokenizer,我面临这样一种情况,即包含多个句子的文本由省略字符(.)分隔。下面是我正在研究的例子:

代码语言:javascript
复制
>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']

正如你所看到的,句子是不分开的。是否有办法使它像我所期望的那样工作(也就是说,返回一个包含四个项的列表)?

附加信息:我尝试使用debug_decisions函数来理解为什么做出这样的决定。我得到了以下结果:

代码语言:javascript
复制
>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")

>>> [x for x in g]
[{'break_decision': None,
  'collocation': False,
  'period_index': 27,
  'reason': 'default decision',
  'text': 'service... Cashier',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'cashier',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 47,
  'reason': 'default decision',
  'text': 'rude... Drive',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'drive',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 72,
  'reason': 'default decision',
  'text': 'hours... The',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'the',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'}]

不幸的是,我无法理解这些词的含义,尽管似乎标记器确实检测到省略号,但出于某种原因,我决定不使用这些符号来拆分句子。有什么想法吗?

谢谢!

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-04-30 14:59:04

你为什么不直接用分裂函数? str.split('...')

编辑:我用路透社的语料库训练这个功能,我想你可以用你的:

代码语言:javascript
复制
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
pst.train(reuters.raw())
text = "Batts did not take questions or give details of the report's findings... He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."
print(pst.sentences_from_text(text))

结果是:

代码语言:javascript
复制
>>> ["Batts did not take questions or give details of the report's findings...", "He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/29970846

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档