ElasticSearch学习笔记 | 分词、IK分词器和自定义分词

比眉伴天荒 2023-01-16 09:47 204阅读 0赞

一个 tokenizer（分词器）接收一个字符流，将之割为独立的 tokens（词元，通常是独立的单词），然后输出 tokens流。

例如，whitespace tokenizer遇到空白字符时分割文。它会将文本 "Quick brown fox!“ 分割为 \[Quick, brown, fox\]。该 tokenizer（分词器）还负责记录各个term（词条）的顺序或 position 位置（用于 phrase短语和 word proximity 词近邻查询），以及term（词条）所代表的原始word（单词）的 start（起始）和end（结束）的 character offsets（字符偏移量）（用于高亮显示搜索的内容）。

ElasticSearch 提供了很多内置的分词器，可以用来构建 custom analyzers（自定义分词器）

一、分词预览查询  
POST \_analyze  
\{  
  "analyzer": "standard",  
  "text": "Note however that storage is optimized based on the actual values that are stored"  
\}  
返回结果

\{  
  "tokens" : \[  
    \{  
      "token" : "note",  
      "start\_offset" : 0,  
      "end\_offset" : 4,  
      "type" : "<ALPHANUM>",  
      "position" : 0  
    \},  
    \{  
      "token" : "however",  
      "start\_offset" : 5,  
      "end\_offset" : 12,  
      "type" : "<ALPHANUM>",  
      "position" : 1  
    \},  
    \{  
      "token" : "that",  
      "start\_offset" : 13,  
      "end\_offset" : 17,  
      "type" : "<ALPHANUM>",  
      "position" : 2  
    \},  
    \{  
      "token" : "storage",  
      "start\_offset" : 18,  
      "end\_offset" : 25,  
      "type" : "<ALPHANUM>",  
      "position" : 3  
    \},  
...  
但是如果是中文场景下就会出现一些问题：

POST \_analyze  
\{  
  "analyzer": "standard",  
  "text": "更改其他索引的字段的映射"  
\}  
返回结果如下，可见将每一个汉字进行了分割，明显不符合实际情况。

\{  
  "tokens" : \[  
    \{  
      "token" : "更",  
      "start\_offset" : 0,  
      "end\_offset" : 1,  
      "type" : "<IDEOGRAPHIC>",  
      "position" : 0  
    \},  
    \{  
      "token" : "改",  
      "start\_offset" : 1,  
      "end\_offset" : 2,  
      "type" : "<IDEOGRAPHIC>",  
      "position" : 1  
    \},  
    \{  
      "token" : "其",  
      "start\_offset" : 2,  
      "end\_offset" : 3,  
      "type" : "<IDEOGRAPHIC>",  
      "position" : 2  
    \},  
...  
二、安装 IK分词器  
在es目录下的plugins目录下创建一个新文件夹，命名为ik，然后把上面的压缩包中的内容解压到该目录中。

把解压出来的内容放到es/plugins/ik中。之后，需要重新启动es。

再次测试：

POST \_analyze  
\{  
  "analyzer": "ik\_smart",  
  "text": "你是列文虎克吗"  
\}  
结果：

\{  
  "tokens" : \[  
    \{  
      "token" : "你",  
      "start\_offset" : 0,  
      "end\_offset" : 1,  
      "type" : "CN\_CHAR",  
      "position" : 0  
    \},  
    \{  
      "token" : "是",  
      "start\_offset" : 1,  
      "end\_offset" : 2,  
      "type" : "CN\_CHAR",  
      "position" : 1  
    \},  
    \{  
      "token" : "列",  
      "start\_offset" : 2,  
      "end\_offset" : 3,  
      "type" : "CN\_CHAR",  
      "position" : 2  
    \},  
    \{  
      "token" : "文虎",  
      "start\_offset" : 3,  
      "end\_offset" : 5,  
      "type" : "CN\_WORD",  
      "position" : 3  
    \},  
    \{  
      "token" : "克",  
      "start\_offset" : 5,  
      "end\_offset" : 6,  
      "type" : "CN\_CHAR",  
      "position" : 4  
    \},  
    \{  
      "token" : "吗",  
      "start\_offset" : 6,  
      "end\_offset" : 7,  
      "type" : "CN\_CHAR",  
      "position" : 5  
    \}  
  \]  
\}  
三、自定义分词  
从上面的例子中可以看到 列文虎克 被拆开了，因为ik分词器依旧不支持部分内容，我们可以自定义分词词库

在elasticsearch-5.6.8\\plugins\\ik\\config下

新增一个z\_SelfAdd.dic文件，在里面加上新的单词，保存为UTF-8

然后在当前目录下的IKAnalyzer.cfg.xml配置文件中下加上<entry key="ext\_dict">z\_SelfAdd.dic</entry>

将刚才命名的文件加入

重启就生效了  
————————————————  
版权声明：本文为CSDN博主「北鹤M」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。  
原文链接：https://blog.csdn.net/qq\_20051535/article/details/113251848