我有一个Lucene索引,其中包含这样的文档:
_id | Name | Alternate Names | Population
123 Bosc de Planavilla (some names here in 5000
345 Planavilla other languages) 20000
456 Bosc de la Planassa 1000
567 Bosc de Plana en Blanca 100000我应该使用的最佳Lucene查询类型是什么,以及我应该如何构造它,考虑到我需要以下内容:
是最流行的。
可能还有很多其他用例..。但你能感觉到我需要什么..。
什么样的查询会对我产生这样的影响?我是否应该生成单词N克( shingles ),并使用这些标记创建一个ORed布尔查询,然后应用自定义评分?还是一个普通的短语查询就行了?我也看到了DisjunctionMaxQuery,但不知道这是不是我想要的.
您现在可能已经看到了这样的想法,即查找用户在其查询中所隐含的确切位置。从此,我可以开始我的Geo搜索,并添加一些进一步的查询围绕这一点。
最好的方法是什么?
提前谢谢。
发布于 2012-01-13 00:06:46
这也是排序的代码。不过,我认为,在考虑到城市规模的情况下,增加一个自定义评分会更有意义,而不是对人口施加暴力。另外,请注意,这使用了FieldCache,这可能不是关于内存使用的最佳解决方案。
public class ShingleFilterTests {
private Analyzer analyzer;
private IndexSearcher searcher;
private IndexReader reader;
private QueryParser qp;
private Sort sort;
public static Analyzer createAnalyzer(final int shingles) {
return new Analyzer() {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream tokenizer = new WhitespaceTokenizer(reader);
tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en"));
if (shingles > 0) {
tokenizer = new ShingleFilter(tokenizer, shingles);
}
return tokenizer;
}
};
}
public class PopulationComparatorSource extends FieldComparatorSource {
@Override
public FieldComparator newComparator(String fieldname, int numHits, int sortPos, boolean reversed) throws IOException {
return new PopulationComparator(fieldname, numHits);
}
private class PopulationComparator extends FieldComparator {
private final String fieldName;
private Integer[] values;
private int[] populations;
private int bottom;
public PopulationComparator(String fieldname, int numHits) {
values = new Integer[numHits];
this.fieldName = fieldname;
}
@Override
public int compare(int slot1, int slot2) {
if (values[slot1] > values[slot2]) return -1;
if (values[slot1] < values[slot2]) return 1;
return 0;
}
@Override
public void setBottom(int slot) {
bottom = values[slot];
}
@Override
public int compareBottom(int doc) throws IOException {
int value = populations[doc];
if (bottom > value) return -1;
if (bottom < value) return 1;
return 0;
}
@Override
public void copy(int slot, int doc) throws IOException {
values[slot] = populations[doc];
}
@Override
public void setNextReader(IndexReader reader, int docBase) throws IOException {
/* XXX uses field cache */
populations = FieldCache.DEFAULT.getInts(reader, "population");
}
@Override
public Comparable value(int slot) {
return values[slot];
}
}
}
@Before
public void setUp() throws Exception {
Directory dir = new RAMDirectory();
analyzer = createAnalyzer(3);
IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa",
"Bosc de Plana en Blanca");
ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000);
for (int id = 0; id < cities.size(); id++) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("population", String.valueOf(populations.get(id)),
Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
writer.close();
qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0));
sort = new Sort(new SortField("population", new PopulationComparatorSource()));
searcher = new IndexSearcher(dir);
searcher.setDefaultFieldSortScoring(true, true);
reader = searcher.getIndexReader();
}
@After
public void tearDown() throws Exception {
searcher.close();
}
@Test
public void testShingleFilter() throws Exception {
System.out.println("shingle filter");
printSearch("city:\"Bosc de Planavilla\"");
printSearch("city:Planavilla");
printSearch("city:Bosc");
}
private void printSearch(String query) throws ParseException, IOException {
Query q = qp.parse(query);
System.out.println("query " + q);
TopDocs hits = searcher.search(q, null, 4, sort);
System.out.println("results " + hits.totalHits);
int i = 1;
for (ScoreDoc dc : hits.scoreDocs) {
Document doc = reader.document(dc.doc);
System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population"));
}
System.out.println();
}
}这产生了以下结果:
query city:"Bosc Planavilla"
results 1
1. doc=0 score=1.143841[5000] "Bosc de Planavilla" population: 5000
query city:Planavilla
results 2
1. doc=1 score=1.287682[20000] "Planavilla" population: 20000
2. doc=0 score=0.643841[5000] "Bosc de Planavilla" population: 5000
query city:Bosc
results 3
1. doc=3 score=0.375[100000] "Bosc de Plana en Blanca" population: 100000
2. doc=0 score=0.5[5000] "Bosc de Planavilla" population: 5000
3. doc=2 score=0.5[1000] "Bosc de la Planassa" population: 1000发布于 2012-01-12 22:36:00
你是如何标记这些字段的?你把它们存储成完整的字符串吗?另外,如何解析查询?
好吧,所以我在玩这个。我一直在用StopFilter来删除la,en,de。然后,我使用一个板条过滤器来获得多个组合,以便进行“精确匹配”。例如,Bosc de Planavilla被标记为Bosc,Bosc de Plana en Blanca被标记为Bosc Plana Blanca。这样您就可以对查询的部分进行“精确匹配”。
然后,我查询用户传递的确切字符串,不过也可以进行一些调整。我用了一个简单的例子,以使结果更符合你所寻找的。
下面是我使用的代码(Lucene3.0.3):
public class ShingleFilterTests {
private Analyzer analyzer;
private IndexSearcher searcher;
private IndexReader reader;
public static Analyzer createAnalyzer(final int shingles) {
return new Analyzer() {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream tokenizer = new WhitespaceTokenizer(reader);
tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en"));
if (shingles > 0) {
tokenizer = new ShingleFilter(tokenizer, shingles);
}
return tokenizer;
}
};
}
@Before
public void setUp() throws Exception {
Directory dir = new RAMDirectory();
analyzer = createAnalyzer(3);
IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa",
"Bosc de Plana en Blanca");
ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000);
for (int id = 0; id < cities.size(); id++) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("population", String.valueOf(populations.get(id)),
Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
writer.close();
searcher = new IndexSearcher(dir);
reader = searcher.getIndexReader();
}
@After
public void tearDown() throws Exception {
searcher.close();
}
@Test
public void testShingleFilter() throws Exception {
System.out.println("shingle filter");
QueryParser qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0));
printSearch(qp, "city:\"Bosc de Planavilla\"");
printSearch(qp, "city:Planavilla");
printSearch(qp, "city:Bosc");
}
private void printSearch(QueryParser qp, String query) throws ParseException, IOException {
Query q = qp.parse(query);
System.out.println("query " + q);
TopDocs hits = searcher.search(q, 4);
System.out.println("results " + hits.totalHits);
int i = 1;
for (ScoreDoc dc : hits.scoreDocs) {
Document doc = reader.document(dc.doc);
System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population"));
}
System.out.println();
}
}我现在正在调查按人口分类的情况。
这张打印出来:
query city:"Bosc Planavilla"
results 1
1. doc=0 score=1.143841 "Bosc de Planavilla" population: 5000
query city:Planavilla
results 2
1. doc=1 score=1.287682 "Planavilla" population: 20000
2. doc=0 score=0.643841 "Bosc de Planavilla" population: 5000
query city:Bosc
results 3
1. doc=0 score=0.5 "Bosc de Planavilla" population: 5000
2. doc=2 score=0.5 "Bosc de la Planassa" population: 1000
3. doc=3 score=0.375 "Bosc de Plana en Blanca" population: 100000https://stackoverflow.com/questions/8755087
复制相似问题