首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在Lucene中使用XML字段进行混合搜索

在Lucene中使用XML字段进行混合搜索
EN

Stack Overflow用户
提问于 2011-10-12 03:47:30
回答 1查看 261关注 0票数 2

我有一个文档语料库,如下所示:

代码语言:javascript
复制
<doc>
text sample text <x>text</x> words lipsum words words <x>text</x> some other text
</doc>

我希望能够从注释中搜索出现在一定数量的标记内的短语(在“”中)。我怎么能像这样索引和搜索呢?

EN

回答 1

Stack Overflow用户

发布于 2012-01-13 06:22:28

您可以使用自定义分析器来解析您的xml流。我使用了一个拆分空格'>‘和'/’的方法,这样XML标记就可以用‘>’和‘/’来标识。

代码语言:javascript
复制
public class SpanQueryTests {
    private IndexSearcher searcher;
    private IndexReader reader;
    private Analyzer analyzer;

    static class XMLTokenizer extends CharTokenizer {
        public XMLTokenizer(Reader input) {
            super(input);
        }

        final static Set<Character> chars = ImmutableSet.of('/', '>');

        @Override
        protected boolean isTokenChar(char c) {
            return !(Character.isWhitespace(c) || chars.contains(c));
        }
    }

    @Before
    public void setUp() throws Exception {
        Directory dir = new RAMDirectory();
        analyzer = new Analyzer() {
            @Override
            public TokenStream tokenStream(String fieldName, Reader reader) {
                return new XMLTokenizer(reader);
            }

            @Override
            public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
                Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream();
                if (tokenizer == null) {
                    tokenizer = new XMLTokenizer(reader);
                    setPreviousTokenStream(tokenizer);
                } else
                    tokenizer.reset(reader);
                return tokenizer;
            }
        };
        IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        ImmutableList<String> docs = ImmutableList.of("<doc>text sample text <x>test</x> words lipsum words words " +
                                                              "<x>text</x> some other text </doc>",
                                                             "<foobar>test</foobar> some more text flop");
        int id = 0;
        for (String content: docs) {
            Document doc = new Document();
            doc.add(new Field("id", String.valueOf(id++), Field.Store.YES, Field.Index.NOT_ANALYZED));
            doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));
            writer.addDocument(doc);
            id++;
        }
        writer.close();

        searcher = new IndexSearcher(dir);
        reader = searcher.getIndexReader();
    }

    @After
    public void tearDown() throws Exception {
        searcher.close();
    }

    @Test
    public void testTermNearQuery() throws Exception {
        SpanTermQuery tq1 = new SpanTermQuery(new Term("content", "lipsum"));
        dumpSpans(tq1);
        SpanTermQuery tq2 = new SpanTermQuery(new Term("content", "other"));
        dumpSpans(tq2);
        SpanTermQuery tq3 = new SpanTermQuery(new Term("content", "<x"));
        dumpSpans(tq3);
        SpanNearQuery snq1 = new SpanNearQuery(new SpanQuery[] { tq1, tq3 }, 2, false);
        dumpSpans(snq1);
        SpanNearQuery snq2 = new SpanNearQuery(new SpanQuery[] { tq2, tq3 }, 2, false);
        dumpSpans(snq2);
    }
}

结果是:

代码语言:javascript
复制
query content:lipsum
   <doc text sample text <x test< x words <lipsum> words words <x text< x some other text < doc (0.15467961)

query content:other
   <doc text sample text <x test< x words lipsum words words <x text< x some <other> text < doc (0.15467961)

query content:<x
   <doc text sample text <<x> test< x words lipsum words words <x text< x some other text < doc (0.21875)
   <doc text sample text <x test< x words lipsum words words <<x> text< x some other text < doc (0.21875)

query spanNear([content:lipsum, content:<x], 2, false)
   <doc text sample text <x test< x words <lipsum words words <x> text< x some other text < doc (0.19565594)

query spanNear([content:other, content:<x], 2, false)
    NO spans
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/7731650

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档