文章/答案/技术大牛

发布

问Lucene 4.0中的词向量频率
EN

Stack Overflow用户

提问于 2012-08-24 02:41:29

回答 4查看 13.4K关注 0票数 9

我正在从Lucene 3.6升级到Lucene 4.0-beta。在Lucene3.x中，IndexReader包含一个IndexReader.getTermFreqVectors()方法，我可以使用它来提取给定文档和字段中每个术语的频率。

此方法现在替换为返回Terms的IndexReader.getTermVectors()。如何使用此方法(或其他方法)提取文档和字段中的词频？

lucene

回答 4

Stack Overflow用户

发布于 2013-04-24 23:35:09

也许这会对你有所帮助：

// get terms vectors for one document and one field
Terms terms = reader.getTermVector(docID, "fieldName"); 

if (terms != null && terms.size() > 0) {
    // access the terms for this field
    TermsEnum termsEnum = terms.iterator(null); 
    BytesRef term = null;

    // explore the terms for this field
    while ((term = termsEnum.next()) != null) {
        // enumerate through documents, in this case only one
        DocsEnum docsEnum = termsEnum.docs(null, null); 
        int docIdEnum;
        while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
            // get the term frequency in the document 
            System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq()); 
        }
    }
}

票数 14

Stack Overflow用户

发布于 2013-01-22 08:21:53

请具体查看此related question

Terms vector = reader.getTermVector(docId, CONTENT);
TermsEnum termsEnum = null;
termsEnum = vector.iterator(termsEnum);
Map<String, Integer> frequencies = new HashMap<>();
BytesRef text = null;
while ((text = termsEnum.next()) != null) {
    String term = text.utf8ToString();
    int freq = (int) termsEnum.totalTermFreq();
    frequencies.put(term, freq);
    terms.add(term);
}

票数 3

Stack Overflow用户

发布于 2012-08-29 12:46:35

关于如何使用灵活的索引apis，有各种文档：

http://lucene.apache.org/core/4_0_0-BETA/MIGRATE.html
https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/apache/lucene/index/package-summary.html#postings

访问文档术语向量的Fields/Terms与用于访问发布列表的API完全相同，因为术语向量实际上只是一个文档的微型倒排索引。

因此，完全可以按原样使用所有这些示例，尽管您可以创建一些快捷方式，因为您知道在这个“微型倒排索引”中只有一个文档。例如，如果你只是想要得到一个词条的频率，你可以找到它并使用像totalTermFreq (参见https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/apache/lucene/index/package-summary.html#stats)这样的聚合统计数据，而不是实际打开一个只会枚举单个文档的DocsEnum。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/12098083

复制

相似问题

问Lucene 4.0中的词向量频率
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Lucene 4.0中的词向量频率EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Lucene 4.0中的词向量频率
EN