我使用nltk PunktSentenceTokenizer将段落分割成句子。我有以下几段:
段落= "1.应聘者在数学方面很差。2.人际交往能力好。3.对社会工作非常热心。“
输出: '1.',‘考生数学水平很差’,'2.',‘人际交往能力很好’,'3.',‘非常热衷于社会工作’。
我尝试使用下面的代码添加已发送的启动程序,但这甚至没有实现。
from nltk.tokenize.punkt import PunktSentenceTokenizer
tokenizer = PunktSentenceTokenizer()
tokenizer._params.sent_starters.add('1.')如果有人能把我推向正确的方向我真的很感激
(预先谢谢:)
发布于 2019-09-01 14:10:09
正则表达式的使用可以解决这类问题,如下代码所示:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
import re
reSentenceEnd = re.compile("\.|$")
reAtLeastTwoLetters = re.compile("[a-zA-Z]{2}")
previousMatch = 0
sentenceStart = 0
end = len(paragraphs)
while(True):
candidateSentenceEnd = reSentenceEnd.search(paragraphs, previousMatch)
# A sentence must contain at least two consecutive letters:
if reAtLeastTwoLetters.search(paragraphs[sentenceStart:candidateSentenceEnd.end()]) :
print(paragraphs[sentenceStart:candidateSentenceEnd.end()])
sentenceStart = candidateSentenceEnd.end()
if candidateSentenceEnd.end() == end:
break
previousMatch=candidateSentenceEnd.start() + 1产出如下:
包括(nltk和Spacy)在内的许多令牌可以处理正则表达式。不过,将此代码调整到他们的框架可能并不简单。
https://stackoverflow.com/questions/57741007
复制相似问题