在我正在构建的索引中,我感兴趣的是运行一个查询,然后(使用facet)返回该查询的块。这是我在文本上使用的分析器:
{
"settings": {
"analysis": {
"analyzer": {
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer"
]
}
},
"filter": {
"custom_stemmer" : {
"type": "stemmer",
"name": "english"
},
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "3"
}
}
}
}
}主要问题是,使用Lucene4.4时,停止过滤器不再支持enable_position_increments参数以消除包含停止词的标记。相反,我会得到像..。
“红黄”
"terms": [
{
"term": "red",
"count": 43
},
{
"term": "red _",
"count": 43
},
{
"term": "red _ yellow",
"count": 43
},
{
"term": "_ yellow",
"count": 42
},
{
"term": "yellow",
"count": 42
}
]自然地,这大大扭曲了返回的瓦的数目。有没有一种方法,后Lucene 4.4管理这一点,而不做后处理的结果?
发布于 2015-06-05 05:34:34
可能不是最理想的解决方案,但最直接的方法是在分析器中添加另一个过滤器,以消除"_“填充令牌。在下面的示例中,我将其命名为"kill_fillers":
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer",
"kill_fillers"
],
...在筛选器列表中添加"kill_fillers“过滤器:
"filters":{
...
"kill_fillers": {
"type": "pattern_replace",
"pattern": ".*_.*",
"replace": "",
},
...
}发布于 2015-11-02 09:26:16
我不确定这是否有帮助,但在弹性板条的定义中,您可以使用参数filler_token,默认情况下是_。例如,将其设置为空字符串:
$indexParams['body']['settings']['analysis']['filter']['shingle-filter']['filler_token'] = "";https://www.elastic.co/guide/en/elasticsearch/reference/1.7/analysis-shingle-tokenfilter.html
https://stackoverflow.com/questions/27410253
复制相似问题