我正在尝试创建一个将以这种方式工作的记号赋予器:
POST dev_threats/_analyze
{
"tokenizer": "my_tokenizer",
"text": "some.test.domain.com"
}并获得如下令牌:
[some, some.test, some.test.domain, some.test.domain.com, test, test.domain, test.domain.com, domain, domain.com]我尝试了ngram标记器:
"ngram_domain_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 63,
"token_chars": [
"letter",
"digit",
"punctuation"
]
},但是对于长值,它会生成太多的标记...
你知道怎样才能得到这样的结果吗?
发布于 2021-06-24 20:40:45
为此,您不需要两个不同的分析器。还有另一个使用带状疱疹的解决方案,它是这样的:
首先,您需要使用适当的分析器创建一个索引,我称之为domain_shingler
PUT dev_threats
{
"settings": {
"analysis": {
"analyzer": {
"domain_shingler": {
"type": "custom",
"tokenizer": "dot_tokenizer",
"filter": [
"shingles",
"joiner"
]
}
},
"tokenizer": {
"dot_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"punctuation"
]
}
},
"filter": {
"shingles": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 4,
"output_unigrams": true
},
"joiner": {
"type": "pattern_replace",
"pattern": """\s""",
"replacement": "."
}
}
}
},
"mappings": {
"properties": {
"domain": {
"type": "text",
"analyzer": "domain_shingler",
"search_analyzer": "standard"
}
}
}
}如果您尝试使用该分析器分析some.test.domain.com,您将得到以下标记:
POST dev_threats/_analyze
{
"analyzer": "domain_shingler",
"text": "some.test.domain.com"
}结果:
{
"tokens" : [
{
"token" : "some",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "some.test",
"start_offset" : 0,
"end_offset" : 9,
"type" : "shingle",
"position" : 0,
"positionLength" : 2
},
{
"token" : "some.test.domain",
"start_offset" : 0,
"end_offset" : 16,
"type" : "shingle",
"position" : 0,
"positionLength" : 3
},
{
"token" : "some.test.domain.com",
"start_offset" : 0,
"end_offset" : 20,
"type" : "shingle",
"position" : 0,
"positionLength" : 4
},
{
"token" : "test",
"start_offset" : 5,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "test.domain",
"start_offset" : 5,
"end_offset" : 16,
"type" : "shingle",
"position" : 1,
"positionLength" : 2
},
{
"token" : "test.domain.com",
"start_offset" : 5,
"end_offset" : 20,
"type" : "shingle",
"position" : 1,
"positionLength" : 3
},
{
"token" : "domain",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 2
},
{
"token" : "domain.com",
"start_offset" : 10,
"end_offset" : 20,
"type" : "shingle",
"position" : 2,
"positionLength" : 2
},
{
"token" : "com",
"start_offset" : 17,
"end_offset" : 20,
"type" : "word",
"position" : 3
}
]
}发布于 2021-06-24 20:30:58
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"custom_path_tree": {
"tokenizer": "custom_hierarchy"
},
"custom_path_tree_reversed": {
"tokenizer": "custom_hierarchy_reversed"
}
},
"tokenizer": {
"custom_hierarchy": {
"type": "path_hierarchy",
"delimiter": "."
},
"custom_hierarchy_reversed": {
"type": "path_hierarchy",
"delimiter": ".",
"reverse": "true"
}
}
}
}
}
POST my-index/_analyze
{
"analyzer": "custom_path_tree",
"text": "some.test.domain.com"
}
POST my-index/_analyze
{
"analyzer": "custom_path_tree_reversed",
"text": "some.test.domain.com"
}**结果**
"tokens" : [
{
"token" : "some",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "some.test",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "some.test.domain",
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 0
},
{
"token" : "some.test.domain.com",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 0
}
]
}
{
"tokens" : [
{
"token" : "some.test.domain.com",
"start_offset" : 0,
"end_offset" : 20,
"type" : "word",
"position" : 0
},
{
"token" : "test.domain.com",
"start_offset" : 5,
"end_offset" : 20,
"type" : "word",
"position" : 0
},
{
"token" : "domain.com",
"start_offset" : 10,
"end_offset" : 20,
"type" : "word",
"position" : 0
},
{
"token" : "com",
"start_offset" : 17,
"end_offset" : 20,
"type" : "word",
"position" : 0
}
]
}它将通过在给定的分隔符上拆分来创建类似于标记的路径。使用normal和reverse选项可以在两个方向上获取令牌
https://stackoverflow.com/questions/68115549
复制相似问题