最近,我从Lucene 3升级到Lucene 6,在v6中,我发现通配符?不再匹配跟随点的数字。下面是一个例子:
要匹配的字符串:a.1a
查询:a.?a
在本例中,查询匹配Lucene 3中的字符串,而不是Lucene 6中的字符串。另一方面,查询a*在Lucene 3和6中都匹配。进一步的测试表明,这种行为上的差异只在点后面跟着一个数字时发生。顺便说一下,我在Lucene 3和6中都使用了StandardAnalyzer。
有人知道这是怎么回事吗?我如何恢复Lucene 3的行为,或者调整我的Lucene 6查询,使其等同于Lucene 3查询?
更新
Lucene6.6代码片段,按要求。
public List<ResultDocument> search(String queryString)
throws SearchException, CheckedOutOfMemoryError {
stopped =false;
QueryWrapper queryWrapper = createQuery(queryString);
Query query = queryWrapper.query;
boolean isPhraseQuery = queryWrapper.isPhraseQuery;
readLock.lock();
try {
checkIndexesExist();
DelegatingCollector collector= new DelegatingCollector(){
@Override
public void collect(int doc) throws IOException {
leafDelegate.collect(doc);
if(stopped){
throw new StoppedSearcherException();
}
}
};
collector.setDelegate(TopScoreDocCollector.create(MAX_RESULTS, null));
try{
luceneSearcher.search(query, collector);
}
catch (StoppedSearcherException e){}
ScoreDoc[] scoreDocs = ((TopScoreDocCollector)collector.getDelegate()).topDocs().scoreDocs;
ResultDocument[] results = new ResultDocument[scoreDocs.length];
for (int i = 0; i < scoreDocs.length; i++) {
Document doc = luceneSearcher.doc(scoreDocs[i].doc);
float score = scoreDocs[i].score;
LuceneIndex index = indexes.get(((DecoratedMultiReader) luceneSearcher.getIndexReader()).decoratedReaderIndex(i));
IndexingConfig config = index.getConfig();
results[i] = new ResultDocument(
doc, score, query, isPhraseQuery, config, fileFactory,
outlookMailFactory);
}
return Arrays.asList(results);
}
catch (IllegalArgumentException e) {
throw wrapEmptyIndexException(e);
}
catch (IOException e) {
throw new SearchException(e.getMessage());
}
catch (OutOfMemoryError e) {
throw new CheckedOutOfMemoryError(e);
}
finally {
readLock.unlock();
}
}更多代码:
private static QueryWrapper createQuery(String queryString)
throws SearchException {
PhraseDetectingQueryParser queryParser = new PhraseDetectingQueryParser(
Fields.CONTENT.key(), IndexRegistry.getAnalyzer());
queryParser.setAllowLeadingWildcard(true);
RewriteMethod rewriteMethod = MultiTermQuery.SCORING_BOOLEAN_REWRITE;
queryParser.setMultiTermRewriteMethod(rewriteMethod);
try {
Query query = queryParser.parse(queryString);
boolean isPhraseQuery = queryParser.isPhraseQuery();
return new QueryWrapper(query, isPhraseQuery);
}
catch (IllegalArgumentException e) {
throw new SearchException(e.getMessage());
}
catch (ParseException e) {
throw new SearchException(e.getMessage());
}
}
private static final class QueryWrapper {
public final Query query;
public final boolean isPhraseQuery;
private QueryWrapper(Query query, boolean isPhraseQuery) {
this.query = query;
this.isPhraseQuery = isPhraseQuery;
}
}更多的代码:
public final class PhraseDetectingQueryParser extends QueryParser {
/*
* This class is used for determining whether the parsed query is supported
* by the fast-vector highlighter. The latter only supports queries that are
* a combination of TermQuery, PhraseQuery and/or BooleanQuery.
*/
private boolean isPhraseQuery = true;
public PhraseDetectingQueryParser( String defaultField,
Analyzer analyzer) {
super(defaultField, analyzer);
}
public boolean isPhraseQuery() {
return isPhraseQuery;
}
protected Query newFuzzyQuery( Term term,
float minimumSimilarity,
int prefixLength) {
isPhraseQuery = false;
return super.newFuzzyQuery(term, minimumSimilarity, prefixLength);
}
protected Query newMatchAllDocsQuery() {
isPhraseQuery = false;
return super.newMatchAllDocsQuery();
}
protected Query newPrefixQuery(Term prefix) {
isPhraseQuery = false;
return super.newPrefixQuery(prefix);
}
protected Query newWildcardQuery(org.apache.lucene.index.Term t) {
isPhraseQuery = false;
return super.newWildcardQuery(t);
}
}发布于 2019-01-22 20:58:53
StandardAnalyzer在这段时间将输入拆分成术语(除非它的两边有一个字母,或者两边都有一个数字)。所以它把它分成两个术语:a和1a。
由于您使用的是通配符查询,所以在查询结束时没有进行任何分析,因此不会得到标记化,而且索引中也没有与查询匹配的任何术语。如果您要搜索"1a",没有通配符或任何东西,您应该找到该文档。
https://stackoverflow.com/questions/53205997
复制相似问题