首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >带短语匹配的边缘NGram

带短语匹配的边缘NGram
EN

Stack Overflow用户
提问于 2016-08-09 10:20:34
回答 2查看 11.7K关注 0票数 11

我需要自动完成词组。例如,当我搜索“阿尔茨海默症中的痴呆”时,我想得到“老年痴呆症”。

为此,我配置了边缘NGram令牌器。我尝试使用edge_ngram_analyzerstandard作为查询体中的分析器。然而,当我试图匹配一个短语时,我无法得到结果。

我做错了什么?

我的问题是:

代码语言:javascript
复制
{
  "query":{
    "multi_match":{
      "query":"dementia in alz",
      "type":"phrase",
      "analyzer":"edge_ngram_analyzer",
      "fields":["_all"]
    }
  }
}

我的映射:

代码语言:javascript
复制
...
"type" : {
  "_all" : {
    "analyzer" : "edge_ngram_analyzer",
    "search_analyzer" : "standard"
  },
  "properties" : {
    "field" : {
      "type" : "string",
      "analyzer" : "edge_ngram_analyzer",
      "search_analyzer" : "standard"
    },
...
"settings" : {
  ...
  "analysis" : {
    "filter" : {
      "stem_possessive_filter" : {
        "name" : "possessive_english",
        "type" : "stemmer"
      }
    },
    "analyzer" : {
      "edge_ngram_analyzer" : {
        "filter" : [ "lowercase" ],
        "tokenizer" : "edge_ngram_tokenizer"
      }
    },
    "tokenizer" : {
      "edge_ngram_tokenizer" : {
        "token_chars" : [ "letter", "digit", "whitespace" ],
        "min_gram" : "2",
        "type" : "edgeNGram",
        "max_gram" : "25"
      }
    }
  }
  ...

我的文件:

代码语言:javascript
复制
{
  "_score": 1.1152233, 
  "_type": "Diagnosis", 
  "_id": "AVZLfHfBE5CzEm8aJ3Xp", 
  "_source": {
    "@timestamp": "2016-08-02T13:40:48.665Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1400541", 
    "Diagnosis": "F00.0 -  Dementia in Alzheimer's disease with early onset", 
    "@version": "1", 
  }, 
  "_index": "carenotes"
}, 
{
  "_score": 1.1152233, 
  "_type": "Diagnosis", 
  "_id": "AVZLfICrE5CzEm8aJ4Dc", 
  "_source": {
    "@timestamp": "2016-08-02T13:40:51.240Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1424351", 
    "Diagnosis": "F00.1 -  Dementia in Alzheimer's disease with late onset", 
    "@version": "1", 
  }, 
  "_index": "carenotes"
}

“老年痴呆症”短语分析:

代码语言:javascript
复制
{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "de", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 4, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 0, 
      "position": 2
    }, 
    {
      "end_offset": 5, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 0, 
      "position": 3
    }, 
    {
      "end_offset": 6, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 0, 
      "position": 4
    }, 
    {
      "end_offset": 7, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 0, 
      "position": 5
    }, 
    {
      "end_offset": 8, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 0, 
      "position": 6
    }, 
    {
      "end_offset": 9, 
      "token": "dementia ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 7
    }, 
    {
      "end_offset": 10, 
      "token": "dementia i", 
      "type": "word", 
      "start_offset": 0, 
      "position": 8
    }, 
    {
      "end_offset": 11, 
      "token": "dementia in", 
      "type": "word", 
      "start_offset": 0, 
      "position": 9
    }, 
    {
      "end_offset": 12, 
      "token": "dementia in ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 10
    }, 
    {
      "end_offset": 13, 
      "token": "dementia in a", 
      "type": "word", 
      "start_offset": 0, 
      "position": 11
    }, 
    {
      "end_offset": 14, 
      "token": "dementia in al", 
      "type": "word", 
      "start_offset": 0, 
      "position": 12
    }, 
    {
      "end_offset": 15, 
      "token": "dementia in alz", 
      "type": "word", 
      "start_offset": 0, 
      "position": 13
    }, 
    {
      "end_offset": 16, 
      "token": "dementia in alzh", 
      "type": "word", 
      "start_offset": 0, 
      "position": 14
    }, 
    {
      "end_offset": 17, 
      "token": "dementia in alzhe", 
      "type": "word", 
      "start_offset": 0, 
      "position": 15
    }, 
    {
      "end_offset": 18, 
      "token": "dementia in alzhei", 
      "type": "word", 
      "start_offset": 0, 
      "position": 16
    }, 
    {
      "end_offset": 19, 
      "token": "dementia in alzheim", 
      "type": "word", 
      "start_offset": 0, 
      "position": 17
    }, 
    {
      "end_offset": 20, 
      "token": "dementia in alzheime", 
      "type": "word", 
      "start_offset": 0, 
      "position": 18
    }, 
    {
      "end_offset": 21, 
      "token": "dementia in alzheimer", 
      "type": "word", 
      "start_offset": 0, 
      "position": 19
    }
  ]
}
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-08-11 10:34:13

非常感谢伦德尔帮助我找到了正确的解决方案!

Andrei Stefan的解不是最优解。

为什么?首先,搜索分析器中缺少小写过滤器,这使得搜索变得不方便;大小写必须严格匹配。需要一个带有lowercase过滤器的自定义分析器,而不是"analyzer": "keyword"

第二,分析部分错了,!在索引时间内,edge_ngram_analyzer分析了一串“早发阿尔茨海默病的F00.0-痴呆”。使用此分析器,我们有以下字典数组作为分析字符串:

代码语言:javascript
复制
{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "f0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "f00", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 6, 
      "token": "0 ", 
      "type": "word", 
      "start_offset": 4, 
      "position": 2
    }, 
    {
      "end_offset": 9, 
      "token": "  ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 3
    }, 
    {
      "end_offset": 10, 
      "token": "  d", 
      "type": "word", 
      "start_offset": 7, 
      "position": 4
    }, 
    {
      "end_offset": 11, 
      "token": "  de", 
      "type": "word", 
      "start_offset": 7, 
      "position": 5
    }, 
    {
      "end_offset": 12, 
      "token": "  dem", 
      "type": "word", 
      "start_offset": 7, 
      "position": 6
    }, 
    {
      "end_offset": 13, 
      "token": "  deme", 
      "type": "word", 
      "start_offset": 7, 
      "position": 7
    }, 
    {
      "end_offset": 14, 
      "token": "  demen", 
      "type": "word", 
      "start_offset": 7, 
      "position": 8
    }, 
    {
      "end_offset": 15, 
      "token": "  dement", 
      "type": "word", 
      "start_offset": 7, 
      "position": 9
    }, 
    {
      "end_offset": 16, 
      "token": "  dementi", 
      "type": "word", 
      "start_offset": 7, 
      "position": 10
    }, 
    {
      "end_offset": 17, 
      "token": "  dementia", 
      "type": "word", 
      "start_offset": 7, 
      "position": 11
    }, 
    {
      "end_offset": 18, 
      "token": "  dementia ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 12
    }, 
    {
      "end_offset": 19, 
      "token": "  dementia i", 
      "type": "word", 
      "start_offset": 7, 
      "position": 13
    }, 
    {
      "end_offset": 20, 
      "token": "  dementia in", 
      "type": "word", 
      "start_offset": 7, 
      "position": 14
    }, 
    {
      "end_offset": 21, 
      "token": "  dementia in ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 15
    }, 
    {
      "end_offset": 22, 
      "token": "  dementia in a", 
      "type": "word", 
      "start_offset": 7, 
      "position": 16
    }, 
    {
      "end_offset": 23, 
      "token": "  dementia in al", 
      "type": "word", 
      "start_offset": 7, 
      "position": 17
    }, 
    {
      "end_offset": 24, 
      "token": "  dementia in alz", 
      "type": "word", 
      "start_offset": 7, 
      "position": 18
    }, 
    {
      "end_offset": 25, 
      "token": "  dementia in alzh", 
      "type": "word", 
      "start_offset": 7, 
      "position": 19
    }, 
    {
      "end_offset": 26, 
      "token": "  dementia in alzhe", 
      "type": "word", 
      "start_offset": 7, 
      "position": 20
    }, 
    {
      "end_offset": 27, 
      "token": "  dementia in alzhei", 
      "type": "word", 
      "start_offset": 7, 
      "position": 21
    }, 
    {
      "end_offset": 28, 
      "token": "  dementia in alzheim", 
      "type": "word", 
      "start_offset": 7, 
      "position": 22
    }, 
    {
      "end_offset": 29, 
      "token": "  dementia in alzheime", 
      "type": "word", 
      "start_offset": 7, 
      "position": 23
    }, 
    {
      "end_offset": 30, 
      "token": "  dementia in alzheimer", 
      "type": "word", 
      "start_offset": 7, 
      "position": 24
    }, 
    {
      "end_offset": 33, 
      "token": "s ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 25
    }, 
    {
      "end_offset": 34, 
      "token": "s d", 
      "type": "word", 
      "start_offset": 31, 
      "position": 26
    }, 
    {
      "end_offset": 35, 
      "token": "s di", 
      "type": "word", 
      "start_offset": 31, 
      "position": 27
    }, 
    {
      "end_offset": 36, 
      "token": "s dis", 
      "type": "word", 
      "start_offset": 31, 
      "position": 28
    }, 
    {
      "end_offset": 37, 
      "token": "s dise", 
      "type": "word", 
      "start_offset": 31, 
      "position": 29
    }, 
    {
      "end_offset": 38, 
      "token": "s disea", 
      "type": "word", 
      "start_offset": 31, 
      "position": 30
    }, 
    {
      "end_offset": 39, 
      "token": "s diseas", 
      "type": "word", 
      "start_offset": 31, 
      "position": 31
    }, 
    {
      "end_offset": 40, 
      "token": "s disease", 
      "type": "word", 
      "start_offset": 31, 
      "position": 32
    }, 
    {
      "end_offset": 41, 
      "token": "s disease ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 33
    }, 
    {
      "end_offset": 42, 
      "token": "s disease w", 
      "type": "word", 
      "start_offset": 31, 
      "position": 34
    }, 
    {
      "end_offset": 43, 
      "token": "s disease wi", 
      "type": "word", 
      "start_offset": 31, 
      "position": 35
    }, 
    {
      "end_offset": 44, 
      "token": "s disease wit", 
      "type": "word", 
      "start_offset": 31, 
      "position": 36
    }, 
    {
      "end_offset": 45, 
      "token": "s disease with", 
      "type": "word", 
      "start_offset": 31, 
      "position": 37
    }, 
    {
      "end_offset": 46, 
      "token": "s disease with ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 38
    }, 
    {
      "end_offset": 47, 
      "token": "s disease with e", 
      "type": "word", 
      "start_offset": 31, 
      "position": 39
    }, 
    {
      "end_offset": 48, 
      "token": "s disease with ea", 
      "type": "word", 
      "start_offset": 31, 
      "position": 40
    }, 
    {
      "end_offset": 49, 
      "token": "s disease with ear", 
      "type": "word", 
      "start_offset": 31, 
      "position": 41
    }, 
    {
      "end_offset": 50, 
      "token": "s disease with earl", 
      "type": "word", 
      "start_offset": 31, 
      "position": 42
    }, 
    {
      "end_offset": 51, 
      "token": "s disease with early", 
      "type": "word", 
      "start_offset": 31, 
      "position": 43
    }, 
    {
      "end_offset": 52, 
      "token": "s disease with early ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 44
    }, 
    {
      "end_offset": 53, 
      "token": "s disease with early o", 
      "type": "word", 
      "start_offset": 31, 
      "position": 45
    }, 
    {
      "end_offset": 54, 
      "token": "s disease with early on", 
      "type": "word", 
      "start_offset": 31, 
      "position": 46
    }, 
    {
      "end_offset": 55, 
      "token": "s disease with early ons", 
      "type": "word", 
      "start_offset": 31, 
      "position": 47
    }, 
    {
      "end_offset": 56, 
      "token": "s disease with early onse", 
      "type": "word", 
      "start_offset": 31, 
      "position": 48
    }
  ]
}

如您所见,整个字符串标记的标记大小为2到25个字符。字符串以线性方式标记,每个新令牌的所有空格和位置都会增加一个。

它有几个问题:

  1. edge_ngram_analyzer生成的无用标记永远不会被搜索,例如:“”、“”、“d”、"s d“、"s disease w”等等。
  2. 而且,--它没有产生很多有用的标记,例如:“疾病”、“早发”等等。如果你试图搜索这些单词,就会有0的结果。
  3. 注意,最后一个标记是“早期的疾病”。最后的"t“在哪里?由于"max_gram" : "25",我们在所有字段中都有一些文本“lost”。您不能再搜索这个文本了,因为它没有标记。
  4. trim过滤器只在可以由令牌程序完成时混淆问题,过滤额外的空间。
  5. edge_ngram_analyzer增加每个令牌的位置,这对于诸如短语查询之类的位置查询是有问题的。我们应该使用edge_ngram_filter,这样在生成ngram时,将保留令牌的位置。

最优解

要使用的映射设置:

代码语言:javascript
复制
...
"mappings": {
    "Type": {
       "_all":{
          "analyzer": "edge_ngram_analyzer", 
          "search_analyzer": "keyword_analyzer"
        }, 
        "properties": {
          "Field": {
            "search_analyzer": "keyword_analyzer",
             "type": "string",
             "analyzer": "edge_ngram_analyzer"
          },
...
...
"settings": {
   "analysis": {
      "filter": {
         "english_poss_stemmer": {
            "type": "stemmer",
            "name": "possessive_english"
         },
         "edge_ngram": {
           "type": "edgeNGram",
           "min_gram": "2",
           "max_gram": "25",
           "token_chars": ["letter", "digit"]
         }
      },
      "analyzer": {
         "edge_ngram_analyzer": {
           "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
           "tokenizer": "standard"
         },
         "keyword_analyzer": {
           "filter": ["lowercase", "english_poss_stemmer"],
           "tokenizer": "standard"
         }
      }
   }
}
...

看一看分析:

代码语言:javascript
复制
{
  "tokens": [
    {
      "end_offset": 5, 
      "token": "f0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00.", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00.0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 17, 
      "token": "de", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 20, 
      "token": "in", 
      "type": "word", 
      "start_offset": 18, 
      "position": 3
    }, 
    {
      "end_offset": 32, 
      "token": "al", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alz", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzh", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzhe", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzhei", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheim", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheime", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheimer", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 40, 
      "token": "di", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "dis", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "dise", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "disea", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "diseas", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "disease", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 45, 
      "token": "wi", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 45, 
      "token": "wit", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 45, 
      "token": "with", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 51, 
      "token": "ea", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "ear", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "earl", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "early", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 57, 
      "token": "on", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "ons", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "onse", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "onset", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }
  ]
}

在索引时间,文本被standard标记器标记化,然后由lowercasepossessive_englishedge_ngram过滤器过滤单独的单词。标记只为单词生成。在搜索时间,文本被standard标记器标记化,然后由lowercasepossessive_english对单独的单词进行过滤。搜索的单词与索引时间内创建的标记相匹配。

因此,我们使增量搜索成为可能!

现在,由于我们对单独的单词执行ngram,我们甚至可以执行如下的查询

代码语言:javascript
复制
{
  'query': {
    'multi_match': {
      'query': 'dem in alzh',  
      'type': 'phrase', 
      'fields': ['_all']
    }
  }
}

得到正确的结果。

没有文本是“丢失”的,一切都是可搜索的,并且不再需要使用trim过滤器来处理空格。

票数 28
EN

Stack Overflow用户

发布于 2016-08-09 14:16:11

我认为您的查询是错误的:虽然在索引时需要nGrams,但在搜索时不需要它们。在搜索时,您需要文本尽可能“固定”。请尝试以下查询:

代码语言:javascript
复制
{
  "query": {
    "multi_match": {
      "query": "  dementia in alz",
      "analyzer": "keyword",
      "fields": [
        "_all"
      ]
    }
  }
}

您注意到在dementia之前有两个空白空间。这些都是由你的分析器从文本。为了摆脱那些你需要的trim token_filter:

代码语言:javascript
复制
   "edge_ngram_analyzer": {
      "filter": [
        "lowercase","trim"
      ],
      "tokenizer": "edge_ngram_tokenizer"
    }

然后这个查询就可以工作了(在dementia之前没有空格):

代码语言:javascript
复制
{
  "query": {
    "multi_match": {
      "query": "dementia in alz",
      "analyzer": "keyword",
      "fields": [
        "_all"
      ]
    }
  }
}
票数 9
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/38848121

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档