我想更改我的openNlp SentenceDetectorME中的句末分隔符。我使用的是opennlp 1.5.3。由于普通版本只检测由'.‘分隔的短语,因此我的目的是添加其他句子分隔符,如';','!’。和'?',将字符数组eos[]传递给SentenceDetectorFactory。我读到您必须使用.train方法SentenceDetectorME,但我不理解它是如何使用的,因为它是静态的,并且需要一个训练模型。有什么建议吗?
我的代码:
import java.io.*;
import opennlp.tools.sentdetect.*;
public class SenTest {
public static void main(String[] args) throws IOException {
String paragraph = "12oz bottle poured into a tulip. Pleasing aromas of citrus rind, lemongrass, peaches, and toasted caramel are picked up from the start. After it settles a bit, more of a fresh baked bread crust and tangerine comes through, and even later, the bread crust turns more towards a blackened pizza crust. It pours a slightly hazy copper-orange color with a creamy white head that retains well; it leaves a thick puffy ring with a creamy island and a decent, messy lace along the glass. Great balance between medium high levels of sweet and bitter. The texture is creamy on the palate with a body towards the higher end of medium. The carbonation is a touch effervescent or fizzy, but overall, soft. There’s a very pronounced grapefruit tartness up front, but it mellows quickly after the first few sips. It finishes with a zesty combination of lemongrass, caramel, and stonefruit. The aftertaste is primarily sweet, overripe tangerines and it’s peel with a tart grapefruit bitter lingering in the mouth. Overall very refreshing, straddles the line between IPA and APA.";
char eos[] = {';', '.', '!', '?' };
int counter = 0;
// always start with a model, a model is learned from training data
InputStream is = new FileInputStream( System.getProperty( "user.dir" ) + "/lib/en-sent.bin" );
SentenceModel model = new SentenceModel( is );
SentenceDetectorME sdetector = new SentenceDetectorME( model );
String sentences[] = sdetector.sentDetect( paragraph );
for ( String s : sentences ) {
counter++;
System.out.println( "Frase numero " + counter + ": " + s );
}
is.close();
}}
发布于 2015-07-10 22:27:44
我想你误解了训练是如何工作的。
您将需要提供大量的句子/段落选择,其中包含您希望检测的字符(!;)等。这是因为opennlp会检测句子中的特征,以确定它是真正的句子拆分,还是出于其他原因只是标点符号插入。
以下面的例子为例:
海伦三十;;岁;;老;她其实很年轻!
在这一行中,;;年;;只是一些标记,不应该被检测为句子分割。( ;;出现的次数越多,将确定它是否是句子拆分)
在您的示例中,您也可以只使用string.split()并根据输入的eos进行拆分,但这意味着您也将在;;模式上拆分上面的句子。
https://stackoverflow.com/questions/24700948
复制相似问题