Elasticsearch でユーザ辞書を利用する方法のメモ。
■辞書(config/dic/user_dic.txt)
■settings(settings_dic.json)
■settings_dic.json を反映
■解析実行
■解析結果
■解析結果(辞書なしの場合)
■辞書(config/dic/user_dic.txt)
リバーシブル,リバーシブル,リバーシブル,カスタム名詞 ヨガマット,ヨガ マット,ヨガ マット,カスタム名詞
■settings(settings_dic.json)
{ "settings": { "analysis": { "tokenizer": { "kuromoji": { "type": "kuromoji_tokenizer", "user_dictionary": "dic/user_dic.txt" } }, "analyzer": { "ja_analyzer": { "tokenizer": "kuromoji", "type": "custom", "mode":"search" } } } } }
■settings_dic.json を反映
curl http://localhost:9200/settings_dic/?pretty \ -XPUT \ -H "Content-Type: application/json" \ -T settings_dic.json
■解析実行
TEXT='リバーシブルヨガマットを買った' curl -XGET \ http://localhost:9200/settings_dic/_analyze?pretty \ -H "Content-Type: application/json" \ -d " { \"analyzer\": \"ja_analyzer\", \"text\": \"${TEXT}\" }"
■解析結果
{ "tokens" : [ { "token" : "リバーシブル", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 0 }, { "token" : "ヨガ", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 1 }, { "token" : "マット", "start_offset" : 8, "end_offset" : 11, "type" : "word", "position" : 2 }, { "token" : "を", "start_offset" : 11, "end_offset" : 12, "type" : "word", "position" : 3 }, { "token" : "買っ", "start_offset" : 12, "end_offset" : 14, "type" : "word", "position" : 4 }, { "token" : "た", "start_offset" : 14, "end_offset" : 15, "type" : "word", "position" : 5 } ] }
■解析結果(辞書なしの場合)
{ "tokens" : [ { "token" : "リバー", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0 }, { "token" : "リバーシブルヨガマット", "start_offset" : 0, "end_offset" : 11, "type" : "word", "position" : 0, "positionLength" : 2 }, { "token" : "シブルヨガマット", "start_offset" : 3, "end_offset" : 11, "type" : "word", "position" : 1 }, { "token" : "を", "start_offset" : 11, "end_offset" : 12, "type" : "word", "position" : 2 }, { "token" : "買っ", "start_offset" : 12, "end_offset" : 14, "type" : "word", "position" : 3 }, { "token" : "た", "start_offset" : 14, "end_offset" : 15, "type" : "word", "position" : 4 } ] }