首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Lucene:精确匹配不会首先显示

Lucene:精确匹配不会首先显示
EN

Stack Overflow用户
提问于 2013-10-07 13:27:40
回答 1查看 745关注 0票数 1

我使用演示IndexFiles和SearchFiles类来索引和搜索org.apache.lucene.demo包中的内容。

我的问题是,当我使用一个包含多个单词的查询时,我得不到与之完全匹配的结果。例如:

代码语言:javascript
复制
Enter query:
"natural language"
Searching for: "natural language"
298 total matching documents
1. download\researchers.uq.edu.au\fields-of-research\natural-language-processing
.txt
2. download\researchers.uq.edu.au\research-project\16267.txt
3. download\researchers.uq.edu.au\research-project\16279.txt
4. download\researchers.uq.edu.au\research-project\18361.txt
5. download\www.uq.edu.au\news\%3Farticle%3D2187.txt
6. download\researchers.uq.edu.au\researcher\2115.txt
7. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project
s-dr-alan-cody%3Fpage%3D1.txt
8. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project
s-dr-alan-cody%3Fpage%3D2.txt
9. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project
s-dr-alan-cody.txt
10. download\www.ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-pr
ojects-dr-alan-cody.txt
Press (n)ext page, (q)uit or enter number to jump to a page.

的结果与以下内容不同:

代码语言:javascript
复制
Enter query:
natural language
Searching for: natural language
54307 total matching documents
1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D190.txt

2. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D576.txt

3. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D46.txt
4. download\espace.library.uq.edu.au\view\UQ%3A166163.txt
5. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D108.txt

6. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D70.txt
7. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D708.txt

8. download\researchers.uq.edu.au\fields-of-research\natural-language-processing
.txt
9. download\researchers.uq.edu.au\research-project\16267.txt
10. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D117.tx
t
Press (n)ext page, (q)uit or enter number to jump to a page.

例如,第一个匹配文档甚至不包含"language“关键字。

如果我在IndexSearcher类中使用explain()方法,那么我将得到第一个结果:

代码语言:javascript
复制
1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D190.txt
0.70643383 = (MATCH) sum of:
  0.5590494 = (MATCH) weight(contents:natural in 62541) [DefaultSimilarity], result of:
    0.5590494 = score(doc=62541,freq=4.0 = termFreq=4.0
), product of:
      0.8091749 = queryWeight, product of:
        4.4216847 = idf(docFreq=13111, maxDocs=401502)
        0.18300149 = queryNorm
      0.6908882 = fieldWeight in 62541, product of:
        2.0 = tf(freq=4.0), with freq of:
          4.0 = termFreq=4.0
        4.4216847 = idf(docFreq=13111, maxDocs=401502)
        0.078125 = fieldNorm(doc=62541)
  0.1473844 = (MATCH) weight(contents:language in 62541) [DefaultSimilarity], result of:
    0.1473844 = score(doc=62541,freq=1.0 = termFreq=1.0
), product of:
      0.5875679 = queryWeight, product of:
        3.2107275 = idf(docFreq=44012, maxDocs=401502)
        0.18300149 = queryNorm
      0.25083807 = fieldWeight in 62541, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        3.2107275 = idf(docFreq=44012, maxDocs=401502)
        0.078125 = fieldNorm(doc=62541)

如果我单击“下一步”并找到如下结果:

代码语言:javascript
复制
19. download\www.uq.edu.au\news\%3Farticle%3D2187.txt
0.47449595 = (MATCH) sum of:
  0.2795247 = (MATCH) weight(contents:natural in 35173) [DefaultSimilarity], result of:
    0.2795247 = score(doc=35173,freq=4.0 = termFreq=4.0
), product of:
      0.8091749 = queryWeight, product of:
        4.4216847 = idf(docFreq=13111, maxDocs=401502)
        0.18300149 = queryNorm
      0.3454441 = fieldWeight in 35173, product of:
        2.0 = tf(freq=4.0), with freq of:
          4.0 = termFreq=4.0
        4.4216847 = idf(docFreq=13111, maxDocs=401502)
        0.0390625 = fieldNorm(doc=35173)
  0.19497125 = (MATCH) weight(contents:language in 35173) [DefaultSimilarity], result of:
    0.19497125 = score(doc=35173,freq=7.0 = termFreq=7.0
), product of:
      0.5875679 = queryWeight, product of:
        3.2107275 = idf(docFreq=44012, maxDocs=401502)
        0.18300149 = queryNorm
      0.33182758 = fieldWeight in 35173, product of:
        2.6457512 = tf(freq=7.0), with freq of:
          7.0 = termFreq=7.0
        3.2107275 = idf(docFreq=44012, maxDocs=401502)
        0.0390625 = fieldNorm(doc=35173)

该页面本身包含确切的关键字“自然语言”。所以我的问题是:

1)为什么Lucene没有首先显示完全匹配?

2)为什么Lucene显示的结果甚至不包含关键字?

3)在哪里/如何更改它,使其首先显示完全匹配的项,然后显示更相关的项?

EN

回答 1

Stack Overflow用户

发布于 2013-10-08 00:54:56

1-它不是故意的。请参阅Lucene query syntax上的文档。查询natural language是由两个术语组成的查询。就它们本身而言,Lucene并不偏爱将术语紧密地放在一起。如果您想要找到精确匹配,短语查询是正确的方法,如"natural language"

2-包含解释的两个结果都包含两个术语的匹配项,请参见:

代码语言:javascript
复制
0.2795247 = (MATCH) weight(contents:natural in 35173) [DefaultSimilarity], result of:
  0.2795247 = score(doc=35173,freq=4.0 = termFreq=4.0
...
0.19497125 = (MATCH) weight(contents:language in 35173) [DefaultSimilarity], result of:
  0.19497125 = score(doc=35173,freq=7.0 = termFreq=7.0

根据Lucene的说法,它在该文档中找到了4次"natural“,在content字段(我假设这是您的默认字段)中找到了7次"language”。

3-查看查询解析器的语法,看看什么对你最有意义。听起来你可能会发现Proximity Searches很有用。

如果您只是想简单地获得短语匹配,后面紧跟着其他短语,那么您可以使用类似以下内容:

代码语言:javascript
复制
"natural language" natural language
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/19217634

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档