文章/答案/技术大牛

发布

社区首页 >问答首页 >NLTK PunktSentenceTokenizer椭圆分裂

问NLTK PunktSentenceTokenizer椭圆分裂
EN

Stack Overflow用户

提问于 2015-04-30 14:47:04

回答 1查看 1.5K关注 0票数 3

我正在使用NLTK PunktSentenceTokenizer，我面临这样一种情况，即包含多个句子的文本由省略字符(.)分隔。下面是我正在研究的例子：

>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']

正如你所看到的，句子是不分开的。是否有办法使它像我所期望的那样工作(也就是说，返回一个包含四个项的列表)？

附加信息：我尝试使用debug_decisions函数来理解为什么做出这样的决定。我得到了以下结果：

>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")

>>> [x for x in g]
[{'break_decision': None,
  'collocation': False,
  'period_index': 27,
  'reason': 'default decision',
  'text': 'service... Cashier',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'cashier',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 47,
  'reason': 'default decision',
  'text': 'rude... Drive',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'drive',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 72,
  'reason': 'default decision',
  'text': 'hours... The',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'the',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'}]

不幸的是，我无法理解这些词的含义，尽管似乎标记器确实检测到省略号，但出于某种原因，我决定不使用这些符号来拆分句子。有什么想法吗？

谢谢!

python

python-2.7

nltk

tokenize

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-04-30 14:59:04

你为什么不直接用分裂函数？ str.split('...')

编辑:我用路透社的语料库训练这个功能，我想你可以用你的：

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
pst.train(reuters.raw())
text = "Batts did not take questions or give details of the report's findings... He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."
print(pst.sentences_from_text(text))

结果是：

>>> ["Batts did not take questions or give details of the report's findings...", "He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/29970846

复制

相似问题

问NLTK PunktSentenceTokenizer椭圆分裂
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NLTK PunktSentenceTokenizer椭圆分裂EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NLTK PunktSentenceTokenizer椭圆分裂
EN