我正在寻找一个在Java的英语lemmatisation实现。我已经找到了一些,但我需要一些不需要太多内存就能运行的东西(最高1 GB )。谢谢。我不需要词干分析器。
发布于 2011-01-13 13:07:47
Stanford CoreNLP Java库包含一个列举器,它有点占用资源,但我已经在我的笔记本电脑上运行了它,它的内存小于512MB。
要使用它:
在您选择的编辑器中创建一个新项目/创建一个包含归档中包含的所有jar文件的ant脚本您只需如下所示(基于斯坦福网站上的代码片段) downloaded;
import java.util.Properties;
public class StanfordLemmatizer {
protected StanfordCoreNLP pipeline;
public StanfordLemmatizer() {
// Create StanfordCoreNLP object properties, with POS tagging
// (required for lemmatization), and lemmatization
Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
// StanfordCoreNLP loads a lot of models, so you probably
// only want to do this once per execution
this.pipeline = new StanfordCoreNLP(props);
}
public List<String> lemmatize(String documentText)
{
List<String> lemmas = new LinkedList<String>();
// create an empty Annotation just with the given text
Annotation document = new Annotation(documentText);
// run all Annotators on this text
this.pipeline.annotate(document);
// Iterate over all of the sentences found
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// Iterate over all tokens in a sentence
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// Retrieve and add the lemma for each word into the list of lemmas
lemmas.add(token.get(LemmaAnnotation.class));
}
}
return lemmas;
}
}发布于 2013-11-11 22:58:24
Chris关于Standford Lemmatizer的回答很棒!绝对漂亮。他甚至包含了一个指向jar文件的指针,所以我不必在google上搜索它。
但是他的一行代码有一个语法错误(他不知何故在以“lemmas.add...”开头的行中换了结束的闭括号和分号),并且他忘记了包括导入。
至于NoSuchMethodError错误,它通常是由该方法没有被设置为公共静态方法造成的,但是如果你看看代码本身(在http://grepcode.com/file/repo1.maven.org/maven2/com.guokr/stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h),这不是问题。我怀疑问题出在构建路径中的某个地方(我使用的是Eclipse Kepler,所以配置我在项目中使用的33个jar文件没有问题)。
下面是我对Chris的代码所做的一些小修改,还有一个例子(我很抱歉,因为example破坏了他们完美的歌词):
import java.util.LinkedList;
import java.util.List;
import java.util.Properties;
import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
public class StanfordLemmatizer {
protected StanfordCoreNLP pipeline;
public StanfordLemmatizer() {
// Create StanfordCoreNLP object properties, with POS tagging
// (required for lemmatization), and lemmatization
Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
/*
* This is a pipeline that takes in a string and returns various analyzed linguistic forms.
* The String is tokenized via a tokenizer (such as PTBTokenizerAnnotator),
* and then other sequence model style annotation can be used to add things like lemmas,
* POS tags, and named entities. These are returned as a list of CoreLabels.
* Other analysis components build and store parse trees, dependency graphs, etc.
*
* This class is designed to apply multiple Annotators to an Annotation.
* The idea is that you first build up the pipeline by adding Annotators,
* and then you take the objects you wish to annotate and pass them in and
* get in return a fully annotated object.
*
* StanfordCoreNLP loads a lot of models, so you probably
* only want to do this once per execution
*/
this.pipeline = new StanfordCoreNLP(props);
}
public List<String> lemmatize(String documentText)
{
List<String> lemmas = new LinkedList<String>();
// Create an empty Annotation just with the given text
Annotation document = new Annotation(documentText);
// run all Annotators on this text
this.pipeline.annotate(document);
// Iterate over all of the sentences found
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// Iterate over all tokens in a sentence
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// Retrieve and add the lemma for each word into the
// list of lemmas
lemmas.add(token.get(LemmaAnnotation.class));
}
}
return lemmas;
}
public static void main(String[] args) {
System.out.println("Starting Stanford Lemmatizer");
String text = "How could you be seeing into my eyes like open doors? \n"+
"You led me down into my core where I've became so numb \n"+
"Without a soul my spirit's sleeping somewhere cold \n"+
"Until you find it there and led it back home \n"+
"You woke me up inside \n"+
"Called my name and saved me from the dark \n"+
"You have bidden my blood and it ran \n"+
"Before I would become undone \n"+
"You saved me from the nothing I've almost become \n"+
"You were bringing me to life \n"+
"Now that I knew what I'm without \n"+
"You can've just left me \n"+
"You breathed into me and made me real \n"+
"Frozen inside without your touch \n"+
"Without your love, darling \n"+
"Only you are the life among the dead \n"+
"I've been living a lie, there's nothing inside \n"+
"You were bringing me to life.";
StanfordLemmatizer slem = new StanfordLemmatizer();
System.out.println(slem.lemmatize(text));
}
}这是我的结果(我印象非常深刻;它抓住了‘s’作为‘is’(有时),几乎所有其他的事情都做得很完美):
启动Stanford Lemmatizer
添加注释器标记化
添加注释器拆分
添加注释器位置
正在从edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger读取POS机标记器模型...完成了1.7秒。
添加注释器引理
你怎么可能,你,看到,进入,我的眼睛,就像,打开,门,?,你,领导,我,下,进入,我的,核心,在哪里,我,已经,成为,所以,麻木,没有,一个,灵魂,我的精神,s,睡眠,某处,寒冷,直到,你,找到,它,那里,和,领导,它,返回,家,你,觉醒,我,向上,内部,呼叫,我的,名字,和,保存,我,从,黑暗,你,有,出价,我的血,和,它,运行,之前,我,将,成为,撤销,你,保存,我,从,没有,我,已经,几乎,成为,你,是,带来,我,到,生活,现在,那个,我,知道,什么,我,是,没有,你,可以,有,只是,离开,我,你,呼吸,进入,我,和,使,我,真实的,冻结的,里面,没有,你,触摸,没有,你,爱,,亲爱的,只有,你,是,生命,没有,死亡,我,有,是,活着,是,躺,,,是,是,什么都不是,里面,你,是,带,我,对,生活,。
发布于 2015-04-17 23:07:31
您可以在此处试用免费的Lemmatizer:http://twinword.com/lemmatizer.php
向下滚动查找Lemmatizer终结点。
这将允许你把“狗”变成“狗”,把“能力”变成“能力”。
如果您传入一个名为"text“的POST或GET参数,其中包含一个类似于"walked plants”的字符串:
// These code snippets use an open-source library. http://unirest.io/java
HttpResponse<JsonNode> response = Unirest.post("[ENDPOINT URL]")
.header("X-Mashape-Key", "[API KEY]")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("Accept", "application/json")
.field("text", "walked plants")
.asJson();您将得到如下响应:
{
"lemma": {
"plant": 1,
"walk": 1
},
"result_code": "200",
"result_msg": "Success"
}https://stackoverflow.com/questions/1578062
复制相似问题