我正在使用Java的BreakIterator类将一段文本分解成各种语言的句子。它运行得很好,但出于某种原因,它在以前不存在的文本中添加了逗号。
看起来它又补充道:
, ,在原来的文本中出现段落中断的文本。由于某种原因,它还在其他逗号之前加上逗号。
下面是我得到的结果类型的一个例子
不过,首先,我得起来了,我的火车五点开走 。 ",他看了看闹钟,在抽屉的箱子上滴答作响。 “天堂的上帝!”
文本应该更像这样:
First of all though, I've got to get up, my train leaves at five.
And he looked over at the alarm clock, ticking on the chest of drawers.
"God in Heaven!" he thought.这是原文:
First of all though, I've got to get up,
my train leaves at five."
And he looked over at the alarm clock, ticking on the chest of
drawers. "God in Heaven!" he thought.我得到了大部分我需要做的事情,但在我把文本分解成句子并手动编辑掉所有额外的逗号后,我仍然要回去。
正如您可能想象的那样,搜索"java分隔符额外的逗号“并没有给我带来很多有用的结果。
下面是我用来做句子检测的函数。
public ArrayList<String> tokenize(String text, Locale locale)
{
ArrayList<String> sentences = new ArrayList<String>();
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(locale);
sentenceIterator.setText(text);
int boundary = sentenceIterator.first();
int lastBoundary = 0;
while (boundary != BreakIterator.DONE)
{
boundary = sentenceIterator.next();
if(boundary != BreakIterator.DONE)
{
sentences.add(text.substring(lastBoundary, boundary));
}
lastBoundary = boundary;
}
return sentences;
}下面是代码的一部分,用于读取文件中的内存并将它们提供给我的分句器:
FileHelper fileHelper = new FileHelper();
TextTokenizer textTokenizer = new TextTokenizer();
Constants constants = new Constants();
ArrayList<String> enMetamorph = fileHelper.readFileToMemory(
constants.books("metamorphosis_en.txt"));
ArrayList<String> enTokenMetamorph = textTokenizer.tokenize(
enMetamorph.toString(),Locale.US);
fileHelper.writeFile(enTokenMetamorph,constants.tokenized(
"metamorphosis_en.txt"));我用的是弗兰兹·卡夫卡的“变形记”。您可以在Project 这里上找到一个免费的UTF-8文本版本。常量对象仅用于创建文件路径。我在图书函数中使用了一个名为makeFilePath的函数,无论程序在哪台计算机上运行,它都会找到图书目录。这一职能如下:
public static String makeFilePath(String addition)
{
String filePath = new File("").getAbsolutePath();
filePath = filePath+addition;
return filePath;
}有人知道为什么我的短信里有这么多额外的逗号吗?
发布于 2014-05-23 01:01:26
问题不在于Java的早餐器类,问题在于Java如何将字符串列表转换为字符串。
下面是导致问题的原因。
ArrayList<String> enTokenMetamorph = textTokenizer.tokenize(enMetamorph.toString(),Locale.US);最后,我编写了我现在正在使用的toString函数。本报告张贴如下:
public String toString(List<String> strings)
{
StringBuilder sb = new StringBuilder();
for(String s:strings)
{
sb.append(" "+s);
}
return sb.toString();
}代码行现在如下所示:
ArrayList<String> enTokenMetamorph = textTokenizer.tokenize(textTokenizer.toString(enMetamorph),Locale.US);这解决了问题。输出现在如下所示:
First of all though, I've got to get up, my train leaves at five.
And he looked over at the alarm clock, ticking on the chest of drawers.
"God in Heaven!"
he thought.与此相反的是:
First of all though, I've got to get up,, my train leaves at five
.
", , And he looked over at the alarm clock, ticking on the chest of, drawers.
"God in Heaven!"https://stackoverflow.com/questions/23810674
复制相似问题