文章/答案/技术大牛

发布

社区首页 >问答首页 >在Elasticsearch和Lucene 4.4中使用Shingles和Stop words

问在Elasticsearch和Lucene 4.4中使用Shingles和Stop words
EN

Stack Overflow用户

提问于 2014-12-10 20:31:01

回答 2查看 1.6K关注 0票数 6

在我正在构建的索引中，我感兴趣的是运行一个查询，然后(使用facet)返回该查询的块。这是我在文本上使用的分析器：

{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingleAnalyzer": {
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "custom_stop",
            "custom_shingle",
            "custom_stemmer"
          ]
        }
      },
      "filter": {
        "custom_stemmer" : {
            "type": "stemmer",
            "name": "english"
        },
        "custom_stop": {
            "type": "stop",
            "stopwords": "_english_"
        },
        "custom_shingle": {
            "type": "shingle",
            "min_shingle_size": "2",
            "max_shingle_size": "3"
        }
      }
    }
  }
}

主要问题是，使用Lucene4.4时，停止过滤器不再支持enable_position_increments参数以消除包含停止词的标记。相反，我会得到像..。

“红黄”

"terms": [
    {
        "term": "red",
        "count": 43
    },
    {
        "term": "red _",
        "count": 43
    },
    {
        "term": "red _ yellow",
        "count": 43
    },
    {
        "term": "_ yellow",
        "count": 42
    },
    {
        "term": "yellow",
        "count": 42
    }
]

自然地，这大大扭曲了返回的瓦的数目。有没有一种方法，后Lucene 4.4管理这一点，而不做后处理的结果？

elasticsearch

lucene

stop-words

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-06-05 05:34:34

可能不是最理想的解决方案，但最直接的方法是在分析器中添加另一个过滤器，以消除"_“填充令牌。在下面的示例中，我将其命名为"kill_fillers"：

   "shingleAnalyzer": {
      "tokenizer": "standard",
      "filter": [
        "standard",
        "lowercase",
        "custom_stop",
        "custom_shingle",
        "custom_stemmer",
        "kill_fillers"
       ],
       ...

在筛选器列表中添加"kill_fillers“过滤器：

"filters":{
...
  "kill_fillers": {
    "type": "pattern_replace",
    "pattern": ".*_.*",
    "replace": "",
  },
...
}

票数 7

Stack Overflow用户

发布于 2015-11-02 09:26:16

我不确定这是否有帮助，但在弹性板条的定义中，您可以使用参数filler_token，默认情况下是_。例如，将其设置为空字符串：

$indexParams['body']['settings']['analysis']['filter']['shingle-filter']['filler_token'] = "";

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/analysis-shingle-tokenfilter.html

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/27410253

复制

相似问题

问在Elasticsearch和Lucene 4.4中使用Shingles和Stop words
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Elasticsearch和Lucene 4.4中使用Shingles和Stop wordsEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Elasticsearch和Lucene 4.4中使用Shingles和Stop words
EN