文章/答案/技术大牛

发布

社区首页 >问答首页 >StanfordCoreNLP不以我的方式工作

问StanfordCoreNLP不以我的方式工作
EN

Stack Overflow用户

提问于 2014-04-15 14:39:54

回答 1查看 1K关注 0票数 0

我使用下面的代码。然而，结果并不如我所期望的那样。结果是[machine, Learning]，但我想得到[machine, learn]。我该怎么做？另外，当我的输入是"biggest bigger"时，我希望得到类似于[big, big]的结果，但是结果只是[biggest bigger]

(PS:我只是在我的eclipse中添加了这四个罐子：joda-time.jar, stanford-corenlp-3.3.1-models.jar, stanford-corenlp-3.3.1.jar, xom.jar，我需要再添加一些吗？)

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");


        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    // Test
    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "Machine Learning\n";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}

java

nlp

stanford-nlp

stemming

lemmatization

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-04-16 04:12:14

理想情况下，词组应该返回一组词的规范形式(称为“引理”或“头词”)。然而，这种规范形式并不总是我们直觉所期望的。例如，你期望“学习”会产生“学习”的引理。但名词“学习”有“学习”的引理，而目前只有连续动词“学习”才有“学习”的引理。在出现歧义的情况下，狐猴应该依赖于词性标签中的信息.

好吧，这解释了机器学习，但是大的，大的

褐斑病的发生依赖于形态学分析。斯坦福大学的形态学类计算英语单词的基本形式，只删除屈折(而不是派生词法)。也就是说，它只做名词复数，代词大小写和动词结尾，而不像比较形容词或派生名词。它基于一个有限状态传感器，由John Carroll等人实现，用flex编写。我找不到原始版本，但Java版本似乎是可在这里找到。

这就是为什么最大公司不会让步的原因。

不过，WordNet词法数据库解析为正确的引理。我通常使用WordNet来完成柠檬化任务，到目前为止还没有发现任何重大问题。正确处理示例的另外两个著名工具是

CST莱姆马蒂泽
MorphAdorner

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/23086961

复制

相似问题

问StanfordCoreNLP不以我的方式工作
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问StanfordCoreNLP不以我的方式工作EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问StanfordCoreNLP不以我的方式工作
EN