文章/答案/技术大牛

发布

社区首页 >问答首页 >Elasticsearch:自定义分词分词器

问Elasticsearch:自定义分词分词器
EN

Stack Overflow用户

提问于 2021-06-24 20:12:19

回答 2查看 58关注 0票数 4

我正在尝试创建一个将以这种方式工作的记号赋予器：

POST dev_threats/_analyze
{
  "tokenizer": "my_tokenizer",
  "text": "some.test.domain.com"
}

并获得如下令牌：

[some, some.test, some.test.domain, some.test.domain.com, test, test.domain, test.domain.com, domain, domain.com]

我尝试了ngram标记器：

    "ngram_domain_tokenizer": {
      "type": "ngram",
      "min_gram": 1,
      "max_gram": 63,
      "token_chars": [
        "letter",
        "digit",
        "punctuation"
      ]
    },

但是对于长值，它会生成太多的标记...

你知道怎样才能得到这样的结果吗？

elasticsearch

tokenize

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-06-24 20:40:45

为此，您不需要两个不同的分析器。还有另一个使用带状疱疹的解决方案，它是这样的：

首先，您需要使用适当的分析器创建一个索引，我称之为domain_shingler

PUT dev_threats
{
  "settings": {
    "analysis": {
      "analyzer": {
        "domain_shingler": {
          "type": "custom",
          "tokenizer": "dot_tokenizer",
          "filter": [
            "shingles",
            "joiner"
          ]
        }
      },
      "tokenizer": {
        "dot_tokenizer": {
          "type": "char_group",
          "tokenize_on_chars": [
            "punctuation"
          ]
        }
      },
      "filter": {
        "shingles": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 4,
          "output_unigrams": true
        },
        "joiner": {
          "type": "pattern_replace",
          "pattern": """\s""",
          "replacement": "."
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "domain": {
        "type": "text",
        "analyzer": "domain_shingler",
        "search_analyzer": "standard"
      }
    }
  }
}

如果您尝试使用该分析器分析some.test.domain.com，您将得到以下标记：

POST dev_threats/_analyze
{
  "analyzer": "domain_shingler",
  "text": "some.test.domain.com"
}

结果：

{
  "tokens" : [
    {
      "token" : "some",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "some.test",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "some.test.domain",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "some.test.domain.com",
      "start_offset" : 0,
      "end_offset" : 20,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 4
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "test.domain",
      "start_offset" : 5,
      "end_offset" : 16,
      "type" : "shingle",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "test.domain.com",
      "start_offset" : 5,
      "end_offset" : 20,
      "type" : "shingle",
      "position" : 1,
      "positionLength" : 3
    },
    {
      "token" : "domain",
      "start_offset" : 10,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "domain.com",
      "start_offset" : 10,
      "end_offset" : 20,
      "type" : "shingle",
      "position" : 2,
      "positionLength" : 2
    },
    {
      "token" : "com",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "word",
      "position" : 3
    }
  ]
}

票数 3

Stack Overflow用户

发布于 2021-06-24 20:30:58

您可以使用path hierarchy tokenizer

PUT my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_path_tree": {
          "tokenizer": "custom_hierarchy"
        },
        "custom_path_tree_reversed": {
          "tokenizer": "custom_hierarchy_reversed"
        }
      },
      "tokenizer": {
        "custom_hierarchy": {
          "type": "path_hierarchy",
          "delimiter": "."
        },
        "custom_hierarchy_reversed": {
          "type": "path_hierarchy",
          "delimiter": ".",
          "reverse": "true"
        }
      }
    }
  }
}

POST my-index/_analyze
{
  "analyzer": "custom_path_tree",
  "text": "some.test.domain.com"
}

POST my-index/_analyze
{
  "analyzer": "custom_path_tree_reversed",
  "text": "some.test.domain.com"
}

**结果**

  "tokens" : [
    {
      "token" : "some",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "some.test",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "some.test.domain",
      "start_offset" : 0,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "some.test.domain.com",
      "start_offset" : 0,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    }
  ]
}


{
  "tokens" : [
    {
      "token" : "some.test.domain.com",
      "start_offset" : 0,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "test.domain.com",
      "start_offset" : 5,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "domain.com",
      "start_offset" : 10,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "com",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    }
  ]
}

它将通过在给定的分隔符上拆分来创建类似于标记的路径。使用normal和reverse选项可以在两个方向上获取令牌

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68115549

复制

相似问题

问Elasticsearch:自定义分词分词器
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Elasticsearch:自定义分词分词器EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Elasticsearch:自定义分词分词器
EN