问如何访问Lucene POS属性(日本黑集分析器)
EN

Stack Overflow用户

提问于 2017-03-16 01:30:58

回答 1查看 328关注 0票数 0

我试图将日语文本标记化，并将部分的语音属性提取为在黑眼世界网站上解释。

Kuromoji / Lucene附带了一个PartOfSpeechAttributeImpl属性实现，它应该提供POS数据，但我无法提取--我在pos.getPartOfSpeech()行上得到了一个NullPointerException。CharTermAttribute指纹。我错过了什么，做错了什么？

    String content = "こんばんは 今日寒かったですね 今日、頂いたお菓子があまりにも美味しくて 上り羊羹 御利益ありそうな、ネーミング ぷるんぷるんの、上品な水羊羹です！ そして、スイーツもう一品！ 先日アップしたお友達の干し芋。";

    Analyzer analyzer = new JapaneseAnalyzer();
    TokenStream stream = analyzer.tokenStream("TEXT", content);

    Iterator<AttributeImpl> it = stream.getAttributeImplsIterator();
    while (it.hasNext()) {
        AttributeImpl attr = it.next();
        System.out.println(attr.getClass());
    }
    CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
    PartOfSpeechAttributeImpl pos = stream.getAttribute(PartOfSpeechAttributeImpl.class);

    stream.reset();
    while (stream.incrementToken()) {
        System.out.println("[" + term.toString() + "]: ");
        System.out.println(pos.getPartOfSpeech());
    }

第一个while循环实际上显示PartOfSpeechAttribute已经添加到令牌流中。这是打印的stmt输出：

   class org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl
   class org.apache.lucene.analysis.ja.tokenattributes.BaseFormAttributeImpl
   class org.apache.lucene.analysis.ja.tokenattributes.PartOfSpeechAttributeImpl
   class org.apache.lucene.analysis.ja.tokenattributes.ReadingAttributeImpl
   class org.apache.lucene.analysis.ja.tokenattributes.InflectionAttributeImpl
   class org.apache.lucene.analysis.tokenattributes.KeywordAttributeImpl

我还遵循了来自其他Stackoverflow帖子的建议，即addAttribute()，而不是PartOfSpeechAttributeImpl的getAttribute()。但是这给了我一个IllegalArgumentException (尽管这个ArributeImpl实现了Lucene属性)：

   java.lang.IllegalArgumentException: addAttribute() only accepts an interface that extends Attribute, but 
   org.apache.lucene.analysis.ja.tokenattributes.PartOfSpeechAttributeImpl does not fulfil this contract.
      at   org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:210)
      at ...

FYI:目前我们使用Lucene 6.0.0。指数化和搜索在日语中很好，因为默认情况下，在Lucene发行版中包含了Kuromoji包(您只需要选择JapaneseAnalyzer)。这个标记化过程发生在指数化或搜索之外，因此不绑定到特定的字段；它用于不同的目的。

谢谢!

lucene

回答 1

Stack Overflow用户

发布于 2017-05-23 07:10:17

PartOfSpeechAttributeImpl不正确。应该是PartOfSpeechAttribute

    PartOfSpeechAttribute pattr = stream.addAttribute(PartOfSpeechAttribute.class);
    try {
        stream.reset();
        while (stream.incrementToken()) {
            cattr.toString();
            String pos[] = pattr.getPartOfSpeech().split("-");
            Token token = new Token(stream.getAttribute(CharTermAttribute.class).toString(), pos);
            result.add(token);
        }
        stream.close();
    } catch (IOException e) {
        return result;
    } finally {
        analyzer.close();
    }
    return result;

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42823757

复制

相似问题

问如何访问Lucene POS属性(日本黑集分析器)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何访问Lucene POS属性(日本黑集分析器)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何访问Lucene POS属性(日本黑集分析器)
EN