我正在使用NLTK PunktSentenceTokenizer,我面临这样一种情况,即包含多个句子的文本由省略字符(.)分隔。下面是我正在研究的例子:
>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']正如你所看到的,句子是不分开的。是否有办法使它像我所期望的那样工作(也就是说,返回一个包含四个项的列表)?
附加信息:我尝试使用debug_decisions函数来理解为什么做出这样的决定。我得到了以下结果:
>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
>>> [x for x in g]
[{'break_decision': None,
'collocation': False,
'period_index': 27,
'reason': 'default decision',
'text': 'service... Cashier',
'type1': '...',
'type1_in_abbrs': False,
'type1_is_initial': False,
'type2': 'cashier',
'type2_is_sent_starter': False,
'type2_ortho_contexts': set(),
'type2_ortho_heuristic': 'unknown'},
{'break_decision': None,
'collocation': False,
'period_index': 47,
'reason': 'default decision',
'text': 'rude... Drive',
'type1': '...',
'type1_in_abbrs': False,
'type1_is_initial': False,
'type2': 'drive',
'type2_is_sent_starter': False,
'type2_ortho_contexts': set(),
'type2_ortho_heuristic': 'unknown'},
{'break_decision': None,
'collocation': False,
'period_index': 72,
'reason': 'default decision',
'text': 'hours... The',
'type1': '...',
'type1_in_abbrs': False,
'type1_is_initial': False,
'type2': 'the',
'type2_is_sent_starter': False,
'type2_ortho_contexts': set(),
'type2_ortho_heuristic': 'unknown'}]不幸的是,我无法理解这些词的含义,尽管似乎标记器确实检测到省略号,但出于某种原因,我决定不使用这些符号来拆分句子。有什么想法吗?
谢谢!
发布于 2015-04-30 14:59:04
你为什么不直接用分裂函数? str.split('...')
编辑:我用路透社的语料库训练这个功能,我想你可以用你的:
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
pst.train(reuters.raw())
text = "Batts did not take questions or give details of the report's findings... He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."
print(pst.sentences_from_text(text))结果是:
>>> ["Batts did not take questions or give details of the report's findings...", "He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']https://stackoverflow.com/questions/29970846
复制相似问题