文章/答案/技术大牛

发布

社区首页 >问答首页 >Lemmatization java

问Lemmatization java
EN

Stack Overflow用户

提问于 2009-10-16 13:33:23

回答 5查看 38.1K关注 0票数 24

我正在寻找一个在Java的英语lemmatisation实现。我已经找到了一些，但我需要一些不需要太多内存就能运行的东西(最高1 GB )。谢谢。我不需要词干分析器。

nlp

java

回答 5

Stack Overflow用户

发布于 2011-01-13 13:07:47

Stanford CoreNLP Java库包含一个列举器，它有点占用资源，但我已经在我的笔记本电脑上运行了它，它的内存小于512MB。

要使用它：

在您选择的编辑器中创建一个新项目/创建一个包含归档中包含的所有jar文件的ant脚本您只需如下所示(基于斯坦福网站上的代码片段) downloaded;

Create一个新的

Download the jar files;
Create；

import java.util.Properties;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        // StanfordCoreNLP loads a lot of models, so you probably
        // only want to do this once per execution
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);

        // run all Annotators on this text
        this.pipeline.annotate(document);

        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }

        return lemmas;
    }
}

票数 37

Stack Overflow用户

发布于 2013-11-11 22:58:24

Chris关于Standford Lemmatizer的回答很棒！绝对漂亮。他甚至包含了一个指向jar文件的指针，所以我不必在google上搜索它。

但是他的一行代码有一个语法错误(他不知何故在以“lemmas.add...”开头的行中换了结束的闭括号和分号)，并且他忘记了包括导入。

至于NoSuchMethodError错误，它通常是由该方法没有被设置为公共静态方法造成的，但是如果你看看代码本身(在http://grepcode.com/file/repo1.maven.org/maven2/com.guokr/stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h)，这不是问题。我怀疑问题出在构建路径中的某个地方(我使用的是Eclipse Kepler，所以配置我在项目中使用的33个jar文件没有问题)。

下面是我对Chris的代码所做的一些小修改，还有一个例子(我很抱歉，因为example破坏了他们完美的歌词)：

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        /*
         * This is a pipeline that takes in a string and returns various analyzed linguistic forms. 
         * The String is tokenized via a tokenizer (such as PTBTokenizerAnnotator), 
         * and then other sequence model style annotation can be used to add things like lemmas, 
         * POS tags, and named entities. These are returned as a list of CoreLabels. 
         * Other analysis components build and store parse trees, dependency graphs, etc. 
         * 
         * This class is designed to apply multiple Annotators to an Annotation. 
         * The idea is that you first build up the pipeline by adding Annotators, 
         * and then you take the objects you wish to annotate and pass them in and 
         * get in return a fully annotated object.
         * 
         *  StanfordCoreNLP loads a lot of models, so you probably
         *  only want to do this once per execution
         */
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "How could you be seeing into my eyes like open doors? \n"+
                "You led me down into my core where I've became so numb \n"+
                "Without a soul my spirit's sleeping somewhere cold \n"+
                "Until you find it there and led it back home \n"+
                "You woke me up inside \n"+
                "Called my name and saved me from the dark \n"+
                "You have bidden my blood and it ran \n"+
                "Before I would become undone \n"+
                "You saved me from the nothing I've almost become \n"+
                "You were bringing me to life \n"+
                "Now that I knew what I'm without \n"+
                "You can've just left me \n"+
                "You breathed into me and made me real \n"+
                "Frozen inside without your touch \n"+
                "Without your love, darling \n"+
                "Only you are the life among the dead \n"+
                "I've been living a lie, there's nothing inside \n"+
                "You were bringing me to life.";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}

这是我的结果(我印象非常深刻；它抓住了‘s’作为‘is’(有时)，几乎所有其他的事情都做得很完美)：

启动Stanford Lemmatizer

添加注释器标记化

添加注释器拆分

添加注释器位置

正在从edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger读取POS机标记器模型...完成了1.7秒。

添加注释器引理

你怎么可能，你，看到，进入，我的眼睛，就像，打开，门，?，你，领导，我，下，进入，我的，核心，在哪里，我，已经，成为，所以，麻木，没有，一个，灵魂，我的精神，s，睡眠，某处，寒冷，直到，你，找到，它，那里，和，领导，它，返回，家，你，觉醒，我，向上，内部，呼叫，我的，名字，和，保存，我，从，黑暗，你，有，出价，我的血，和，它，运行，之前，我，将，成为，撤销，你，保存，我，从，没有，我，已经，几乎，成为，你，是，带来，我，到，生活，现在，那个，我，知道，什么，我，是，没有，你，可以，有，只是，离开，我，你，呼吸，进入，我，和，使，我，真实的，冻结的，里面，没有，你，触摸，没有，你，爱，，亲爱的，只有，你，是，生命，没有，死亡，我，有，是，活着，是，躺，，，是，是，什么都不是，里面，你，是，带，我，对，生活，。

票数 17

Stack Overflow用户

发布于 2015-04-17 23:07:31

您可以在此处试用免费的Lemmatizer：http://twinword.com/lemmatizer.php

向下滚动查找Lemmatizer终结点。

这将允许你把“狗”变成“狗”，把“能力”变成“能力”。

如果您传入一个名为"text“的POST或GET参数，其中包含一个类似于"walked plants”的字符串：

// These code snippets use an open-source library. http://unirest.io/java
HttpResponse<JsonNode> response = Unirest.post("[ENDPOINT URL]")
.header("X-Mashape-Key", "[API KEY]")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("Accept", "application/json")
.field("text", "walked plants")
.asJson();

您将得到如下响应：

{
  "lemma": {
    "plant": 1,
    "walk": 1
  },
  "result_code": "200",
  "result_msg": "Success"
}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/1578062

复制

相似问题

问Lemmatization java
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Lemmatization javaEN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Lemmatization java
EN