我正在做一个副项目,将自然语言处理应用于临床数据,我正在使用Java的BreakIterator将文本分成句子,以便进一步分析。在使用BreakIterator时,我遇到了BreakIterator无法识别以数值开头的句子的问题。
示例:
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence."预期输出:
1) No acute osseous abnormality.
2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.实际输出:
1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.代码:
import java.text.BreakIterator;
import java.util.*;
public class Test {
public static void main(String[] args) {
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence";
Locale locale = Locale.US;
BreakIterator splitIntoSentences = BreakIterator.getSentenceInstance(locale);
splitIntoSentences.setText(text);
int index = 0;
while (splitIntoSentences.next() != BreakIterator.DONE) {
String sentence = text.substring(index, splitIntoSentences.current());
System.out.println(sentence);
index = splitIntoSentences.current();
}
}
}任何帮助都将不胜感激。我试着在网上寻找答案,但无济于事。
发布于 2020-11-24 12:20:09
我现在用的不是BreakIterator,而是Apache OpenNLP,它工作得很好!
https://stackoverflow.com/questions/64708428
复制相似问题