dak ブログ

python、rubyなどのプログラミング、MySQL、サーバーの設定などの備忘録。レゴの写真も。

Elasticsearch での kuromoji での同義語辞書

2022-10-23 13:16:46 | elasticsearch
Elasticsearch で kuromoji に同義語辞書を適用する方法のメモ。
■同義語辞書
同義語辞書ファイルを相対パスで指定する場合には、elasticsearch をインストールしたディレクトリの
config ディレクトリが起点になります。
今回は elasticsearch/config/synonym/synonym_dic.txt に以下の内容を記載します。
にっぽん, ニッポン, 日本
書籍, 本
アメリカ => 米国

■インデックス定義(test_synonym.json)
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ja_text_analyzer": {
          "tokenizer": "kuromoji_tokenizer",
          "type": "custom",
          "char_filter": [
            "icu_normalizer",
            "kuromoji_iteration_mark"
          ],
          "filter": [
            "kuromoji_readingform",
            "kuromoji_stemmer",
            "kuromoji_part_of_speech",
            "ja_stop",
            "kuromoji_stemmer",
            "synonym_dic"
          ]
        }
      },

      "filter": {
        "synonym_dic": {
          "type": "synonym_graph",
          "synonyms_path": "synonym/synonym_dic.txt"
        }
      }

    }
  },

  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id": {
        "type": "keyword",
        "store": "true"
      },
      "title": {
        "type": "text",
        "store": "true",
        "analyzer": "ja_text_analyzer"
      },
      "body": {
        "type": "text",
        "store": "true",
        "analyzer": "ja_text_analyzer"
      }
    }
  }
}

■インデックス作成
上記のインデックス定義で test_synonym インデックスを生成します。
curl "http://localhost:9200/test_syononym?pretty" \
     -X PUT \
     -H 'Content-Type: application/json' \
     -T test_synonym.json

■登録データ(bulk_test_synonym.jsonl)
{"index": {"_id": "id_01"}}
{"id": "id_01", "title": "日本", "body": "日本の家"}

{"index": {"_id": "id_02"}}
{"id": "id_02", "title": "ヨーロッパ", "body": "ヨーロッパの本"}

{"index": {"_id": "id_03"}}
{"id": "id_03", "title": "アメリカ", "body": "アメリカのテレビ"}

{"index": {"_id": "id_04"}}
{"id": "id_04", "title": "米国", "body": "米国のラジオ"}

■データ登録
curl "http://localhost:9200/test_synonym/_bulk?pretty" \
     -X POST \
     -H 'Content-Type: application/x-ndjson' \
     -T bulk_test_synonym.jsonl

■検索(書籍)
「書籍」で検索すると、辞書の「書籍, 本」のエントリにより「本」にもヒットします。
curl "http://localhost:9200/test_synonym/_search?pretty" \
     --silent \
     -X GET \
     -H 'Content-Type: application/json' \
     -T '{"query": {"match": {"body": "書籍"}}}'

実行結果は以下の通り。
    "hits" : [
      {
        ...
        "_source" : {
          "id" : "id_02",
          "title" : "ヨーロッパ",
          "body" : "ヨーロッパの本"
        }
      }
    ]

■検索(アメリカ)
「アメリカ」で検索すると、辞書の「アメリカ => 米国」のエントリにより「米国」にもヒットします。
curl "http://localhost:9200/test_synonym/_search?pretty" \
     --silent \
     -X GET \
     -H 'Content-Type: application/json' \
     -T '{"query": {"match": {"body": "アメリカ"}}}'

実行結果は以下の通り。
    "hits" : [
      {
        ...
        "_source" : {
          "id" : "id_03",
          "title" : "アメリカ",
          "body" : "アメリカのテレビ"
        }
      },
      {
        ...
        "_source" : {
          "id" : "id_04",
          "title" : "米国",
          "body" : "米国のラジオ"
        }
      }
    ]

■検索3(米国)
「米国」で検索すると、辞書は「アメリカ => 米国」となっており、「米国」では「アメリカ」を同義語として参照しません。
curl "http://localhost:9200/test_synonym/_search?pretty" \
     --silent \
     -X GET \
     -H 'Content-Type: application/json' \
     -T '{"query": {"match": {"body": "米国"}}}'

実行結果は以下の通り。
    "hits" : [
      {
        ...
        "_source" : {
          "id" : "id_04",
          "title" : "米国",
          "body" : "米国のラジオ"
        }
      }
    ]


Elasticsearch での形態素解析

2022-10-16 23:37:55 | elasticsearch
Elasticsearch で形態素解析を行い、各 token の品所などの情報を取得することができます。
■インデックス定義(create_test1.json)
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ja_text_analyzer1": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "icu_normalizer",
            "kuromoji_baseform",
            "to_katakana"
          ]
        }
      },
      "filter": {
        "to_katakana": {
          "type": "icu_transform",
          "id": "Hiragana-Katakana"
        }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "text": {"type": "text", "store": "true", "analyzer": "ja_text_analyzer1"}
    }
  }
}

■インデックス作成
curl "http://localhost:9200/test1?pretty" \
     -X PUT \
     -H 'Content-Type: application/json' \
     -T create_test1.json

■解析(詳細情報なし)
curl "http://localhost:9200/test1/_analyze?pretty" \
     -XGET \
     -H 'Content-Type: application/json' \
     -v \
     --data '
{
     "analyzer": "ja_text_analyzer1",
     "text": "私は日本人です"
}'

■解析結果
{
  "tokens" : [
    {
      "token" : "私",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ハ",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "日本人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "デス",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    }
  ]
}

■解析(詳細情報あり)
"explain": "true" を指定すると、token の詳細情報を取得できます。
curl "http://localhost:9200/test1/_analyze?pretty" \
     -XGET \
     -H 'Content-Type: application/json' \
     -v \
     --data '
{
     "analyzer": "ja_text_analyzer1",
     "explain": "true",
     "text": "私は日本人です"
}'
■解析結果
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "kuromoji_tokenizer",
      "tokens" : [
        {
          "token" : "私",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "word",
          "position" : 0,
          "baseForm" : null,
          "bytes" : "[e7 a7 81]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "名詞-代名詞-一般",
          "partOfSpeech (en)" : "noun-pronoun-misc",
          "positionLength" : 1,
          "pronunciation" : "ワタシ",
          "pronunciation (en)" : "watashi",
          "reading" : "ワタシ",
          "reading (en)" : "watashi",
          "termFrequency" : 1
        },
        {
          "token" : "は",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "word",
          "position" : 1,
          "baseForm" : null,
          "bytes" : "[e3 81 af]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "助詞-係助詞",
          "partOfSpeech (en)" : "particle-dependency",
          "positionLength" : 1,
          "pronunciation" : "ワ",
          "pronunciation (en)" : "wa",
          "reading" : "ハ",
          "reading (en)" : "ha",
          "termFrequency" : 1
        },
        {
          "token" : "日本人",
          "start_offset" : 2,
          "end_offset" : 5,
          "type" : "word",
          "position" : 2,
          "baseForm" : null,
          "bytes" : "[e6 97 a5 e6 9c ac e4 ba ba]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "名詞-一般",
          "partOfSpeech (en)" : "noun-common",
          "positionLength" : 1,
          "pronunciation" : "ニッポンジン",
          "pronunciation (en)" : "nipponjin",
          "reading" : "ニッポンジン",
          "reading (en)" : "nipponjin",
          "termFrequency" : 1
        },
        {
          "token" : "です",
          "start_offset" : 5,
          "end_offset" : 7,
          "type" : "word",
          "position" : 3,
          "baseForm" : null,
          "bytes" : "[e3 81 a7 e3 81 99]",
          "inflectionForm" : "基本形",
          "inflectionForm (en)" : "base",
          "inflectionType" : "特殊・デス",
          "inflectionType (en)" : "special-desu",
          "partOfSpeech" : "助動詞",
          "partOfSpeech (en)" : "auxiliary-verb",
          "positionLength" : 1,
          "pronunciation" : "デス",
          "pronunciation (en)" : "desu",
          "reading" : "デス",
          "reading (en)" : "desu",
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [
      {
        "name" : "icu_normalizer",
        "tokens" : [
          {
            "token" : "私",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "word",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e7 a7 81]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : "名詞-代名詞-一般",
            "partOfSpeech (en)" : "noun-pronoun-misc",
            "positionLength" : 1,
            "pronunciation" : "ワタシ",
            "pronunciation (en)" : "watashi",
            "reading" : "ワタシ",
            "reading (en)" : "watashi",
            "termFrequency" : 1
          },
          {
            "token" : "は",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "word",
            "position" : 1,
            "baseForm" : null,
            "bytes" : "[e3 81 af]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : "助詞-係助詞",
            "partOfSpeech (en)" : "particle-dependency",
            "positionLength" : 1,
            "pronunciation" : "ワ",
            "pronunciation (en)" : "wa",
            "reading" : "ハ",
            "reading (en)" : "ha",
            "termFrequency" : 1
          },
          {
            "token" : "日本人",
            "start_offset" : 2,
            "end_offset" : 5,
            "type" : "word",
            "position" : 2,
            "baseForm" : null,
            "bytes" : "[e6 97 a5 e6 9c ac e4 ba ba]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : "名詞-一般",
            "partOfSpeech (en)" : "noun-common",
            "positionLength" : 1,
            "pronunciation" : "ニッポンジン",
            "pronunciation (en)" : "nipponjin",
            "reading" : "ニッポンジン",
            "reading (en)" : "nipponjin",
            "termFrequency" : 1
          },
          {
            "token" : "です",
            "start_offset" : 5,
            "end_offset" : 7,
            "type" : "word",
            "position" : 3,
            "baseForm" : null,
            "bytes" : "[e3 81 a7 e3 81 99]",
            "inflectionForm" : "基本形",
            "inflectionForm (en)" : "base",
            "inflectionType" : "特殊・デス",
            "inflectionType (en)" : "special-desu",
            "partOfSpeech" : "助動詞",
            "partOfSpeech (en)" : "auxiliary-verb",
            "positionLength" : 1,
            "pronunciation" : "デス",
            "pronunciation (en)" : "desu",
            "reading" : "デス",
            "reading (en)" : "desu",
            "termFrequency" : 1
          }
        ]
      },
      {
        "name" : "kuromoji_baseform",
        "tokens" : [
          {
            "token" : "私",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "word",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e7 a7 81]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "名詞-代名詞-一般",
            "partOfSpeech (en)" : "noun-pronoun-misc",
            "positionLength" : 1,
            "pronunciation" : "ワタシ",
            "pronunciation (en)" : "watashi",
            "reading" : "ワタシ",
            "reading (en)" : "watashi",
            "termFrequency" : 1
          },
          {
            "token" : "は",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "word",
            "position" : 1,
            "baseForm" : null,
            "bytes" : "[e3 81 af]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "助詞-係助詞",
            "partOfSpeech (en)" : "particle-dependency",
            "positionLength" : 1,
            "pronunciation" : "ワ",
            "pronunciation (en)" : "wa",
            "reading" : "ハ",
            "reading (en)" : "ha",
            "termFrequency" : 1
          },
          {
            "token" : "日本人",
            "start_offset" : 2,
            "end_offset" : 5,
            "type" : "word",
            "position" : 2,
            "baseForm" : null,
            "bytes" : "[e6 97 a5 e6 9c ac e4 ba ba]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "名詞-一般",
            "partOfSpeech (en)" : "noun-common",
            "positionLength" : 1,
            "pronunciation" : "ニッポンジン",
            "pronunciation (en)" : "nipponjin",
            "reading" : "ニッポンジン",
            "reading (en)" : "nipponjin",
            "termFrequency" : 1
          },
          {
            "token" : "です",
            "start_offset" : 5,
            "end_offset" : 7,
            "type" : "word",
            "position" : 3,
            "baseForm" : null,
            "bytes" : "[e3 81 a7 e3 81 99]",
            "inflectionForm" : "基本形",
            "inflectionForm (en)" : "base",
            "inflectionType" : "特殊・デス",
            "inflectionType (en)" : "special-desu",
            "keyword" : false,
            "partOfSpeech" : "助動詞",
            "partOfSpeech (en)" : "auxiliary-verb",
            "positionLength" : 1,
            "pronunciation" : "デス",
            "pronunciation (en)" : "desu",
            "reading" : "デス",
            "reading (en)" : "desu",
            "termFrequency" : 1
          }
        ]
      },
      {
        "name" : "to_katakana",
        "tokens" : [
          {
            "token" : "私",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "word",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e7 a7 81]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "名詞-代名詞-一般",
            "partOfSpeech (en)" : "noun-pronoun-misc",
            "positionLength" : 1,
            "pronunciation" : "ワタシ",
            "pronunciation (en)" : "watashi",
            "reading" : "ワタシ",
            "reading (en)" : "watashi",
            "termFrequency" : 1
          },
          {
            "token" : "ハ",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "word",
            "position" : 1,
            "baseForm" : null,
            "bytes" : "[e3 83 8f]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "助詞-係助詞",
            "partOfSpeech (en)" : "particle-dependency",
            "positionLength" : 1,
            "pronunciation" : "ワ",
            "pronunciation (en)" : "wa",
            "reading" : "ハ",
            "reading (en)" : "ha",
            "termFrequency" : 1
          },
          {
            "token" : "日本人",
            "start_offset" : 2,
            "end_offset" : 5,
            "type" : "word",
            "position" : 2,
            "baseForm" : null,
            "bytes" : "[e6 97 a5 e6 9c ac e4 ba ba]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "名詞-一般",
            "partOfSpeech (en)" : "noun-common",
            "positionLength" : 1,
            "pronunciation" : "ニッポンジン",
            "pronunciation (en)" : "nipponjin",
            "reading" : "ニッポンジン",
            "reading (en)" : "nipponjin",
            "termFrequency" : 1
          },
          {
            "token" : "デス",
            "start_offset" : 5,
            "end_offset" : 7,
            "type" : "word",
            "position" : 3,
            "baseForm" : null,
            "bytes" : "[e3 83 87 e3 82 b9]",
            "inflectionForm" : "基本形",
            "inflectionForm (en)" : "base",
            "inflectionType" : "特殊・デス",
            "inflectionType (en)" : "special-desu",
            "keyword" : false,
            "partOfSpeech" : "助動詞",
            "partOfSpeech (en)" : "auxiliary-verb",
            "positionLength" : 1,
            "pronunciation" : "デス",
            "pronunciation (en)" : "desu",
            "reading" : "デス",
            "reading (en)" : "desu",
            "termFrequency" : 1
          }
        ]
      }
    ]
  }
}


lxml で xpath で text を検索した際の返却値のクラス

2022-10-12 23:07:41 | python
lxml で xpath で text を検索した際の返却値のクラスは lxml.etree._ElementUnicodeResult のリストです。
返却値を str クラスにするには、str(...) で str クラスに変換します。
import sys
import lxml.html

htmlstr = """
<html>
<body>
  <div class="d1">
    <div id="d2-1">
      <div id="d3-1">
        <p id="p4-1">p4-1 text</p>
        <p id="p4-2">p4-2 text</p>
      </div>
      <div id="d3-2"></div>
    </div>
    <div id="d2-2">
      <p id="p3-3">p3-3 text</p>
      <p id="p3-4">p3-4 text</p>
    </div>
  </div>
</body>
</html>
"""

dom = lxml.html.fromstring(htmlstr)

texts = dom.xpath("//div/p/text()")
for text in texts:
    print(f"text: {type(text)}: [{text}]")
    print(f"str:  {type(str(text))}: [{str(text)}]")