我知道允许U+00E9 -> e的charset_table设置,它将'é‘映射到'e’。然而,如果你使用的不是U+00E9,而是U+0065 U+0301 (这是“é”的“分解”形式,只是“e”后面跟着一个尖锐的重音),那么Sphinx会将U+0301视为空格,并将单词拆分。
示例:
mysql> CALL KEYWORDS('Crème brûlée', 'recipes_rt', 1);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | creme | creme | 3 | 3 |
| 2 | brulee | brulee | 2 | 2 |
+------+-----------+------------+------+------+
2 rows in set (0.00 sec)
mysql> CALL KEYWORDS('Crème brûlée', 'recipes_rt', 1);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | creme | creme | 3 | 3 |
| 2 | brule | brule | 0 | 0 |
| 3 | e | e | 3 | 3 |
+------+-----------+------------+------+------+
3 rows in set (0.15 sec)这里需要NFKC Unicode规范化,但我在文档中看不到任何提到这一点。
发布于 2015-11-20 19:48:37
不确定如何‘可伸缩’地处理它(即所有的表单),但个体可能是用regexp_filter完成的
http://sphinxsearch.com/docs/current/conf-regexp-filter.html
regexp_filter = \%u0065\%u0301 => e尽管如此,也许只需将U+0301 (和其他'combining‘字符)添加到ignore_chars?http://sphinxsearch.com/docs/current/conf-ignore-chars.html
它们消失了,留下了“无重音”的char (e)
https://stackoverflow.com/questions/33811296
复制相似问题