首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Lucene查询(带shingles?)

Lucene查询(带shingles?)
EN

Stack Overflow用户
提问于 2012-01-06 08:13:50
回答 2查看 2.7K关注 0票数 1

我有一个Lucene索引,其中包含这样的文档:

代码语言:javascript
复制
_id     |           Name            |        Alternate Names      |    Population

123       Bosc de Planavilla               (some names here in          5000
345       Planavilla                       other languages)             20000
456       Bosc de la Planassa                                           1000
567       Bosc de Plana en Blanca                                       100000

我应该使用的最佳Lucene查询类型是什么,以及我应该如何构造它,考虑到我需要以下内容:

  1. 如果用户查询:“Bosc de Planavilla附近的意大利餐厅”,我希望返回id 123的文档,因为它包含与doc.
  2. 完全匹配的文档,如果用户查询的是:“靠近Planavilla的意大利餐厅”,我想要id 345的文档,因为查询包含精确匹配的内容,而且它的人口最多。如果用户查询"Bosc“附近的”意大利餐厅“,则需要567,因为查询包含"Bosc”,而“Bosc”是3“Bosc”的。

是最流行的。

可能还有很多其他用例..。但你能感觉到我需要什么..。

什么样的查询会对我产生这样的影响?我是否应该生成单词N克( shingles ),并使用这些标记创建一个ORed布尔查询,然后应用自定义评分?还是一个普通的短语查询就行了?我也看到了DisjunctionMaxQuery,但不知道这是不是我想要的.

您现在可能已经看到了这样的想法,即查找用户在其查询中所隐含的确切位置。从此,我可以开始我的Geo搜索,并添加一些进一步的查询围绕这一点。

最好的方法是什么?

提前谢谢。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2012-01-13 00:06:46

这也是排序的代码。不过,我认为,在考虑到城市规模的情况下,增加一个自定义评分会更有意义,而不是对人口施加暴力。另外,请注意,这使用了FieldCache,这可能不是关于内存使用的最佳解决方案。

代码语言:javascript
复制
public class ShingleFilterTests {
    private Analyzer analyzer;
    private IndexSearcher searcher;
    private IndexReader reader;
    private QueryParser qp;
    private Sort sort;

    public static Analyzer createAnalyzer(final int shingles) {
        return new Analyzer() {
            @Override
            public TokenStream tokenStream(String fieldName, Reader reader) {
                TokenStream tokenizer = new WhitespaceTokenizer(reader);
                tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en"));
                if (shingles > 0) {
                    tokenizer = new ShingleFilter(tokenizer, shingles);
                }
                return tokenizer;
            }
        };
    }

    public class PopulationComparatorSource extends FieldComparatorSource {
        @Override
        public FieldComparator newComparator(String fieldname, int numHits, int sortPos, boolean reversed) throws IOException {
            return new PopulationComparator(fieldname, numHits);
        }

        private class PopulationComparator extends FieldComparator {
            private final String fieldName;
            private Integer[] values;
            private int[] populations;
            private int bottom;

            public PopulationComparator(String fieldname, int numHits) {
                values = new Integer[numHits];
                this.fieldName = fieldname;
            }

            @Override
            public int compare(int slot1, int slot2) {
                if (values[slot1] > values[slot2]) return -1;
                if (values[slot1] < values[slot2]) return 1;
                return 0;
            }

            @Override
            public void setBottom(int slot) {
                bottom = values[slot];
            }

            @Override
            public int compareBottom(int doc) throws IOException {
                int value = populations[doc];
                if (bottom > value) return -1;
                if (bottom < value) return 1;
                return 0;
            }

            @Override
            public void copy(int slot, int doc) throws IOException {
                values[slot] = populations[doc];
            }

            @Override
            public void setNextReader(IndexReader reader, int docBase) throws IOException {
                /* XXX uses field cache */
                populations = FieldCache.DEFAULT.getInts(reader, "population");
            }

            @Override
            public Comparable value(int slot) {
                return values[slot];
            }
        }
    }

    @Before
    public void setUp() throws Exception {
        Directory dir = new RAMDirectory();
        analyzer = createAnalyzer(3);

        IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa",
                                                               "Bosc de Plana en Blanca");
        ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000);

        for (int id = 0; id < cities.size(); id++) {
            Document doc = new Document();
            doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
            doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED));
            doc.add(new Field("population", String.valueOf(populations.get(id)),
                                     Field.Store.YES, Field.Index.NOT_ANALYZED));
            writer.addDocument(doc);
        }
        writer.close();

        qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0));
        sort = new Sort(new SortField("population", new PopulationComparatorSource()));
        searcher = new IndexSearcher(dir);
        searcher.setDefaultFieldSortScoring(true, true);
        reader = searcher.getIndexReader();
    }

    @After
    public void tearDown() throws Exception {
        searcher.close();
    }

    @Test
    public void testShingleFilter() throws Exception {
        System.out.println("shingle filter");

        printSearch("city:\"Bosc de Planavilla\"");
        printSearch("city:Planavilla");
        printSearch("city:Bosc");
    }

    private void printSearch(String query) throws ParseException, IOException {
        Query q = qp.parse(query);
        System.out.println("query " + q);
        TopDocs hits = searcher.search(q, null, 4, sort);
        System.out.println("results " + hits.totalHits);
        int i = 1;
        for (ScoreDoc dc : hits.scoreDocs) {
            Document doc = reader.document(dc.doc);
            System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population"));
        }
        System.out.println();
    }
}

这产生了以下结果:

代码语言:javascript
复制
query city:"Bosc Planavilla"
results 1
1. doc=0 score=1.143841[5000] "Bosc de Planavilla" population: 5000

query city:Planavilla
results 2
1. doc=1 score=1.287682[20000] "Planavilla" population: 20000
2. doc=0 score=0.643841[5000] "Bosc de Planavilla" population: 5000

query city:Bosc
results 3
1. doc=3 score=0.375[100000] "Bosc de Plana en Blanca" population: 100000
2. doc=0 score=0.5[5000] "Bosc de Planavilla" population: 5000
3. doc=2 score=0.5[1000] "Bosc de la Planassa" population: 1000
票数 1
EN

Stack Overflow用户

发布于 2012-01-12 22:36:00

你是如何标记这些字段的?你把它们存储成完整的字符串吗?另外,如何解析查询?

好吧,所以我在玩这个。我一直在用StopFilter来删除la,en,de。然后,我使用一个板条过滤器来获得多个组合,以便进行“精确匹配”。例如,Bosc de Planavilla被标记为Bosc,Bosc de Plana en Blanca被标记为Bosc Plana Blanca。这样您就可以对查询的部分进行“精确匹配”。

然后,我查询用户传递的确切字符串,不过也可以进行一些调整。我用了一个简单的例子,以使结果更符合你所寻找的。

下面是我使用的代码(Lucene3.0.3):

代码语言:javascript
复制
public class ShingleFilterTests {
    private Analyzer analyzer;
    private IndexSearcher searcher;
    private IndexReader reader;

    public static Analyzer createAnalyzer(final int shingles) {
        return new Analyzer() {
            @Override
            public TokenStream tokenStream(String fieldName, Reader reader) {
                TokenStream tokenizer = new WhitespaceTokenizer(reader);
                tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en"));
                if (shingles > 0) {
                    tokenizer = new ShingleFilter(tokenizer, shingles);
                }
                return tokenizer;
            }
        };
    }

    @Before
    public void setUp() throws Exception {
        Directory dir = new RAMDirectory();
        analyzer = createAnalyzer(3);

        IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa",
                                                               "Bosc de Plana en Blanca");
        ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000);

        for (int id = 0; id < cities.size(); id++) {
            Document doc = new Document();
            doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
            doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED));
            doc.add(new Field("population", String.valueOf(populations.get(id)),
                                     Field.Store.YES, Field.Index.NOT_ANALYZED));
            writer.addDocument(doc);
        }
        writer.close();

        searcher = new IndexSearcher(dir);
        reader = searcher.getIndexReader();
    }

    @After
    public void tearDown() throws Exception {
        searcher.close();
    }

    @Test
    public void testShingleFilter() throws Exception {
        System.out.println("shingle filter");

        QueryParser qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0));

        printSearch(qp, "city:\"Bosc de Planavilla\"");
        printSearch(qp, "city:Planavilla");
        printSearch(qp, "city:Bosc");
    }

    private void printSearch(QueryParser qp, String query) throws ParseException, IOException {
        Query q = qp.parse(query);

        System.out.println("query " + q);
        TopDocs hits = searcher.search(q, 4);
        System.out.println("results " + hits.totalHits);
        int i = 1;
        for (ScoreDoc dc : hits.scoreDocs) {
            Document doc = reader.document(dc.doc);
            System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population"));
        }
        System.out.println();
    }
}

我现在正在调查按人口分类的情况。

这张打印出来:

代码语言:javascript
复制
query city:"Bosc Planavilla"
results 1
1. doc=0 score=1.143841 "Bosc de Planavilla" population: 5000

query city:Planavilla
results 2
1. doc=1 score=1.287682 "Planavilla" population: 20000
2. doc=0 score=0.643841 "Bosc de Planavilla" population: 5000

query city:Bosc
results 3
1. doc=0 score=0.5 "Bosc de Planavilla" population: 5000
2. doc=2 score=0.5 "Bosc de la Planassa" population: 1000
3. doc=3 score=0.375 "Bosc de Plana en Blanca" population: 100000
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/8755087

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档