「elasticsearch」のブログ記事一覧-dak ブログ

curl で elastic cloud にアクセス

2024-03-08 10:20:04 | elasticsearch

curl で Elastic Cloud にアクセスする方法のメモ。

#!/bin/sh

NODE_URL='{ノード (https://...)}'
INDEX='{インデックス}'
ENC_API_KEY='{エンコード済 API Key}'

curl \
    -X GET \
    -H "Content-Type: application/json" \
    -H "Authorization:ApiKey ${ENC_API_KEY}" \
    "${NODE_URL}/${INDEX}/_search?pretty" \
    -d '
{
  "query": {
    "match_all": {}
  },
  "size": 10
}'

TypeScript で elastic cloud に接続

2024-03-08 10:18:26 | elasticsearch

TypeScript で elastic cloud に接続する方法のメモ。

import { Client } from '@elastic/elasticsearch';

(async () => {
  const client = new Client({
    node: '{ノード (https://...)}',
    cloud: {
      id: '{Cloud ID}'
    },
    auth: {
      username: '{ユーザ名}',
      password: '{パスワード}',
    }
  });

  const cond = {
    index: '...',
    query: {
      match_all: {}
    },
    size: 10
  };

  const res = await client.search(cond);
  console.log(res);
})();

Elasticsearch で nested フィールドを検索

2024-02-24 21:21:58 | elasticsearch

Elasticsearch で nested フィールドに対する検索のメモ。
■インデックス

{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id": {"type": "keyword", "store": "true"},
      "tags": {
        "type": "nested",
        "properties": {
          "tag": {"type": "keyword", "store": "true"}
        }
      },
      "status": {"type": "integer", "store": "true"}
      }
    }
  }
}

■bulk でデータ登録

{"index": {"_id": "doc_1"}}
{"id": "doc_1", "tags": [{"tag": "abc"}, {"tag": "def"}], "status": 1}

{"index": {"_id": "doc_2"}}
{"id": "doc_1", "tags": [{"tag": "def"}, {"tag": "ghi"}], "status": 1}

{"index": {"_id": "doc_3"}}
{"id": "doc_1", "tags": [{"tag": "ghi"}, {"tag": "jkl"}], "status": 1}

■検索クエリ

{
  "query": {
    "bool": {
      "must": [
        {"term": {"status": 1}},
        {"nested": {
            "path": "tags",
          "query": {"term": {"tags.tag": "def"}}
        }}
      ]
    }
  }
}

■検索結果

{
  ...,
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.0296195,
    "hits" : [
      {
        "_index" : "test_nested_01",
        "_id" : "doc_1",
        "_score" : 2.0296195,
        "_source" : {
          "id" : "doc_1",
          "tags" : [
            {
              "tag" : "abc"
            },
            {
              "tag" : "def"
            }
          ],
          "status" : 1
        }
      },
      {
        "_index" : "test_nested_01",
        "_id" : "doc_2",
        "_score" : 2.0296195,
        "_source" : {
          "id" : "doc_1",
          "tags" : [
            {
              "tag" : "def"
            },
            {
              "tag" : "ghi"
            }
          ],
          "status" : 1
        }
      }
    ]
  }
}

Elasticsearch で複数のフィールドでグループ化してカウント

2024-01-16 00:10:25 | elasticsearch

Elasticsearch で複数のフィールドでレコード数をカウントする方法のメモ。

■クエリ

{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "count_name_value": {
      "multi_terms": {
        "terms": [{ "field": "name" }, { "field": "value" }],
        "size": 2
      }
    }
  },
  "_source": ["name", "value"],
  "size": 0
}

■実行結果

{
  "took" : 23,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "count_name_value" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 1,
      "buckets" : [
        {
          "key" : [
            "3",
            "3"
          ],
          "key_as_string" : "3|3",
          "doc_count" : 3
        },
        {
          "key" : [
            "2",
            "2"
          ],
          "key_as_string" : "2|2",
          "doc_count" : 2
        }
      ]
    }
  }
}

Elasticsearch で数値のフィールドの値によるスコアリング

2022-12-11 13:46:35 | elasticsearch

Elasticsearch で数値のフィールドの値によるスコアリングのメモ。
value フィールドの値の重みを変更してスコアがどう変わるかを確認します。

■インデックス定義

{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id": {"type": "keyword", "store": "true"},
      "title": {"type": "text", "store": "true"},
      "body": {"type": "text", "store": "true"},
      "value": {"type": "float", "store": "true"}
    }
  }
}

■クエリパターン

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "should": [
            {"match": {"title": {"query": "abc", "boost": 【title_boost】}}},
            {"match": {"body": {"query": "abc", "boost": 【body_boost】}}}
          ]
        }
      },
      "functions": [
        {"field_value_factor": {
          "field": "value",
          "modifier": "none",
          "factor": 【value_factor】
        }}
      ],
      "score_mode": "sum",
      "boost_mode": "sum"
    }
  },
  "_source": ["id", "title", "score"]
}

■title_boost: 1.0、body_boost: 1.0、value_factor: 0.0 の場合

    "hits" : [
      {
        "_index" : "idx2",
        "_id" : "doc_01",
        "_score" : 2.059239,
        "_source" : {
          "id" : "doc_01",
          "title" : "商品1 abc社"
        }
      },
      {
        "_index" : "idx2",
        "_id" : "doc_02",
        "_score" : 2.059239,
        "_source" : {
          "id" : "doc_02",
          "title" : "商品2 abc社"
        }
      }
    ]

■title_boost: 1.0、body_boost: 1.0、value_factor: 1.0 の場合
doc_01 の value の値（1.0）、doc_02 の value の値（2.0）が加算されています。

    "hits" : [
      {
        "_index" : "idx2",
        "_id" : "doc_02",
        "_score" : 4.059239,
        "_source" : {
          "id" : "doc_02",
          "title" : "商品2 abc社"
        }
      },
      {
        "_index" : "idx2",
        "_id" : "doc_01",
        "_score" : 3.059239,
        "_source" : {
          "id" : "doc_01",
          "title" : "商品1 abc社"
        }
      }
    ]

■title_boost: 1.0、body_boost: 1.0、value_factor: 10.0 の場合
doc_01 の value の値（1.0）、doc_02 の value の値（2.0）の 10倍の値が加算されています。

    "hits" : [
      {
        "_index" : "idx2",
        "_id" : "doc_02",
        "_score" : 22.059238,
        "_source" : {
          "id" : "doc_02",
          "title" : "商品2 abc社"
        }
      },
      {
        "_index" : "idx2",
        "_id" : "doc_01",
        "_score" : 12.059238,
        "_source" : {
          "id" : "doc_01",
          "title" : "商品1 abc社"
        }
      }
    ]

Elastcisearch でフィールド毎に重みづけしたスコアリング

2022-12-10 23:43:00 | elasticsearch

Elastcisearch でフィールド毎に重みづけして検索を行う方法のメモ。
■検索クエリパターン
title、body の各フィールドに対して、スコアにそれぞれ title_boost、body_boost を掛け、その結果を加算したものが最終的なスコアとなります。

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "should": [
            {"match": {"title": {"query": "abc", "boost": 【title_boost】}}},
            {"match": {"body": {"query": "abc", "boost": 【body_boost】}}}
          ]
        }
      },
      "score_mode": "sum",
      "boost_mode": "sum"
    }
  }
}

■title_boost: 0.0、body_boost: 0.0 での検索結果
スコアは 0.0 となっています。

    "hits" : [
      {
        "_index" : "idx1",
        "_id" : "doc_01",
        "_score" : 0.0,
        "_source" : {
          "id" : "doc_01",
          "title" : "商品 abc"
        }
      }
    ]

■title_boost: 1.0、body_boost: 0.0 での検索結果
スコアは 1.540445 となっていて、これが title にマッチした場合のスコアとなります。

    "hits" : [
      {
        "_index" : "idx1",
        "_id" : "doc_01",
        "_score" : 1.540445,
        "_source" : {
          "id" : "doc_01",
          "title" : "商品 abc"
        }
      }
    ]

■title_boost: 0.0、body_boost: 1.0 での検索結果
スコアは 1.4916282 となっていて、これが body にマッチした場合のスコアとなります。

    "hits" : [
      {
        "_index" : "idx1",
        "_id" : "doc_01",
        "_score" : 1.4916282,
        "_source" : {
          "id" : "doc_01",
          "title" : "商品 abc"
        }
      }
    ]

■title_boost: 10.0、body_boost: 0.0 での検索結果
スコアは 15.404451 で、title_boost: 1.0 の場合（1.540445）の 10倍の値となっています。

    "hits" : [
      {
        "_index" : "idx1",
        "_id" : "doc_01",
        "_score" : 15.404451,
        "_source" : {
          "id" : "doc_01",
          "title" : "商品 abc"
        }
      }
    ]

■title_boost: 10.0、body_boost: 1.0 での検索結果
スコアは 16.89608 で、title_boost: 10.0、body_boost: 0.0 の場合のスコア（15.404451）と、title_boost: 0.0、body_boost: 1.0 の場合のスコア（1.4916282）の和となっています。

    "hits" : [
      {
        "_index" : "idx1",
        "_id" : "doc_01",
        "_score" : 16.89608,
        "_source" : {
          "id" : "doc_01",
          "title" : "商品 abc"
        }
      }
    ]

Elasticsearch で aggs を用いた facet 検索

2022-12-04 11:54:28 | elasticsearch

Elasticsearch で aggs を用いた facet 検索

Elasticsearch で特定の属性を指定せずに aggs でファセット検索を行う方法のメモ。
features に属性毎に key、value を持たせていますが、
検索する際にはファセットを表示する際の属性名（name）、選択肢（option）を使用します。

■インデックス定義

{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id": {
        "type": "keyword",
        "store": "true"
      },
      "features": {
        "type": "nested",
        "properties": {
          "key": {"type": "keyword"},
          "name": {"type": "keyword"},
          "value": {"type": "keyword"},
          "option": {"type": "keyword"}
        }
      }
    }
  }
}

■データ例

{ 
  "id": "doc_1",
  "features": [
    {"key": "maker",
     "value": "aaa",
     "name": "メーカー",
     "option": "AAA社"},
    {"key": "material",
     "value": "gold",
     "name": "材質",
     "option": "ゴールド"},
    {"key": "price",
     "value": "2500",
     "name": "価格",
     "option": "2,001円 ～ 3,000円"}
  ]
}

■検索

{
  "size": 0,
  "aggs": {
    "keys": {
      "nested": {
        "path": "features"
      },
      "aggs": {
        "by_name": {
          "terms": {
            "field": "features.name",
            "size": 5
          },
          "aggs" : {
            "by_option": {
              "terms": {
                "field": "features.option",
                "size": 5
              }
            }
          }
        }
      }
    }
  }
}

■検索結果

{
  ...,
  "aggregations" : {
    "keys" : {
      "doc_count" : 18,
      "by_name" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [
          {
            "key" : "メーカー",
            "doc_count" : 6,
            "by_option" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "AAA社",
                  "doc_count" : 2
                },
                {
                  "key" : "BBB社",
                  "doc_count" : 2
                },
                {
                  "key" : "CCC社",
                  "doc_count" : 2
                }
              ]
            }
          },
          {
            "key" : "価格",
            "doc_count" : 6,
            "by_option" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "1,001円 ～ 2,000円",
                  "doc_count" : 3
                },
                {
                  "key" : "2,001円 ～ 3,000円",
                  "doc_count" : 2
                },
                {
                  "key" : "～ 1,000円",
                  "doc_count" : 1
                }
              ]
            }
          },
          {
            "key" : "材質",
            "doc_count" : 6,
            "by_option" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "ゴールド",
                  "doc_count" : 2
                },
                {
                  "key" : "シルバー",
                  "doc_count" : 2
                },
                {
                  "key" : "スチール",
                  "doc_count" : 1
                },
                {
                  "key" : "ブロンズ",
                  "doc_count" : 1
                }
              ]
            }
          }
        ]
      }
    }
  }
}

Elasticsearch での kuromoji での同義語辞書

2022-10-23 13:16:46 | elasticsearch

Elasticsearch で kuromoji に同義語辞書を適用する方法のメモ。
■同義語辞書
同義語辞書ファイルを相対パスで指定する場合には、elasticsearch をインストールしたディレクトリの
config ディレクトリが起点になります。
今回は elasticsearch/config/synonym/synonym_dic.txt に以下の内容を記載します。

にっぽん, ニッポン, 日本
書籍, 本
アメリカ => 米国

■インデックス定義（test_synonym.json）

{
  "settings": {
    "analysis": {
      "analyzer": {
        "ja_text_analyzer": {
          "tokenizer": "kuromoji_tokenizer",
          "type": "custom",
          "char_filter": [
            "icu_normalizer",
            "kuromoji_iteration_mark"
          ],
          "filter": [
            "kuromoji_readingform",
            "kuromoji_stemmer",
            "kuromoji_part_of_speech",
            "ja_stop",
            "kuromoji_stemmer",
            "synonym_dic"
          ]
        }
      },

      "filter": {
        "synonym_dic": {
          "type": "synonym_graph",
          "synonyms_path": "synonym/synonym_dic.txt"
        }
      }

    }
  },

  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id": {
        "type": "keyword",
        "store": "true"
      },
      "title": {
        "type": "text",
        "store": "true",
        "analyzer": "ja_text_analyzer"
      },
      "body": {
        "type": "text",
        "store": "true",
        "analyzer": "ja_text_analyzer"
      }
    }
  }
}

■インデックス作成
上記のインデックス定義で test_synonym インデックスを生成します。

curl "http://localhost:9200/test_syononym?pretty" \
     -X PUT \
     -H 'Content-Type: application/json' \
     -T test_synonym.json

■登録データ（bulk_test_synonym.jsonl）

{"index": {"_id": "id_01"}}
{"id": "id_01", "title": "日本", "body": "日本の家"}

{"index": {"_id": "id_02"}}
{"id": "id_02", "title": "ヨーロッパ", "body": "ヨーロッパの本"}

{"index": {"_id": "id_03"}}
{"id": "id_03", "title": "アメリカ", "body": "アメリカのテレビ"}

{"index": {"_id": "id_04"}}
{"id": "id_04", "title": "米国", "body": "米国のラジオ"}

■データ登録

curl "http://localhost:9200/test_synonym/_bulk?pretty" \
     -X POST \
     -H 'Content-Type: application/x-ndjson' \
     -T bulk_test_synonym.jsonl

■検索（書籍）
「書籍」で検索すると、辞書の「書籍, 本」のエントリにより「本」にもヒットします。

curl "http://localhost:9200/test_synonym/_search?pretty" \
     --silent \
     -X GET \
     -H 'Content-Type: application/json' \
     -T '{"query": {"match": {"body": "書籍"}}}'

実行結果は以下の通り。

    "hits" : [
      {
        ...
        "_source" : {
          "id" : "id_02",
          "title" : "ヨーロッパ",
          "body" : "ヨーロッパの本"
        }
      }
    ]

■検索（アメリカ）
「アメリカ」で検索すると、辞書の「アメリカ => 米国」のエントリにより「米国」にもヒットします。

curl "http://localhost:9200/test_synonym/_search?pretty" \
     --silent \
     -X GET \
     -H 'Content-Type: application/json' \
     -T '{"query": {"match": {"body": "アメリカ"}}}'

実行結果は以下の通り。

    "hits" : [
      {
        ...
        "_source" : {
          "id" : "id_03",
          "title" : "アメリカ",
          "body" : "アメリカのテレビ"
        }
      },
      {
        ...
        "_source" : {
          "id" : "id_04",
          "title" : "米国",
          "body" : "米国のラジオ"
        }
      }
    ]

■検索3（米国）
「米国」で検索すると、辞書は「アメリカ => 米国」となっており、「米国」では「アメリカ」を同義語として参照しません。

curl "http://localhost:9200/test_synonym/_search?pretty" \
     --silent \
     -X GET \
     -H 'Content-Type: application/json' \
     -T '{"query": {"match": {"body": "米国"}}}'

実行結果は以下の通り。

    "hits" : [
      {
        ...
        "_source" : {
          "id" : "id_04",
          "title" : "米国",
          "body" : "米国のラジオ"
        }
      }
    ]

Elasticsearch での形態素解析

2022-10-16 23:37:55 | elasticsearch

Elasticsearch で形態素解析を行い、各 token の品所などの情報を取得することができます。
■インデックス定義（create_test1.json）

{
  "settings": {
    "analysis": {
      "analyzer": {
        "ja_text_analyzer1": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "icu_normalizer",
            "kuromoji_baseform",
            "to_katakana"
          ]
        }
      },
      "filter": {
        "to_katakana": {
          "type": "icu_transform",
          "id": "Hiragana-Katakana"
        }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "text": {"type": "text", "store": "true", "analyzer": "ja_text_analyzer1"}
    }
  }
}

■インデックス作成

curl "http://localhost:9200/test1?pretty" \
     -X PUT \
     -H 'Content-Type: application/json' \
     -T create_test1.json

■解析（詳細情報なし）

curl "http://localhost:9200/test1/_analyze?pretty" \
     -XGET \
     -H 'Content-Type: application/json' \
     -v \
     --data '
{
     "analyzer": "ja_text_analyzer1",
     "text": "私は日本人です"
}'

■解析結果

{
  "tokens" : [
    {
      "token" : "私",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ハ",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "日本人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "デス",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    }
  ]
}

■解析（詳細情報あり）
"explain": "true" を指定すると、token の詳細情報を取得できます。

curl "http://localhost:9200/test1/_analyze?pretty" \
     -XGET \
     -H 'Content-Type: application/json' \
     -v \
     --data '
{
     "analyzer": "ja_text_analyzer1",
     "explain": "true",
     "text": "私は日本人です"
}'
■解析結果{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "kuromoji_tokenizer",
      "tokens" : [
        {
          "token" : "私",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "word",
          "position" : 0,
          "baseForm" : null,
          "bytes" : "[e7 a7 81]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "名詞-代名詞-一般",
          "partOfSpeech (en)" : "noun-pronoun-misc",
          "positionLength" : 1,
          "pronunciation" : "ワタシ",
          "pronunciation (en)" : "watashi",
          "reading" : "ワタシ",
          "reading (en)" : "watashi",
          "termFrequency" : 1
        },
        {
          "token" : "は",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "word",
          "position" : 1,
          "baseForm" : null,
          "bytes" : "[e3 81 af]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "助詞-係助詞",
          "partOfSpeech (en)" : "particle-dependency",
          "positionLength" : 1,
          "pronunciation" : "ワ",
          "pronunciation (en)" : "wa",
          "reading" : "ハ",
          "reading (en)" : "ha",
          "termFrequency" : 1
        },
        {
          "token" : "日本人",
          "start_offset" : 2,
          "end_offset" : 5,
          "type" : "word",
          "position" : 2,
          "baseForm" : null,
          "bytes" : "[e6 97 a5 e6 9c ac e4 ba ba]",
          "inflectionForm" : null,
          "inflectionForm (en)" : null,
          "inflectionType" : null,
          "inflectionType (en)" : null,
          "partOfSpeech" : "名詞-一般",
          "partOfSpeech (en)" : "noun-common",
          "positionLength" : 1,
          "pronunciation" : "ニッポンジン",
          "pronunciation (en)" : "nipponjin",
          "reading" : "ニッポンジン",
          "reading (en)" : "nipponjin",
          "termFrequency" : 1
        },
        {
          "token" : "です",
          "start_offset" : 5,
          "end_offset" : 7,
          "type" : "word",
          "position" : 3,
          "baseForm" : null,
          "bytes" : "[e3 81 a7 e3 81 99]",
          "inflectionForm" : "基本形",
          "inflectionForm (en)" : "base",
          "inflectionType" : "特殊・デス",
          "inflectionType (en)" : "special-desu",
          "partOfSpeech" : "助動詞",
          "partOfSpeech (en)" : "auxiliary-verb",
          "positionLength" : 1,
          "pronunciation" : "デス",
          "pronunciation (en)" : "desu",
          "reading" : "デス",
          "reading (en)" : "desu",
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [
      {
        "name" : "icu_normalizer",
        "tokens" : [
          {
            "token" : "私",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "word",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e7 a7 81]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : "名詞-代名詞-一般",
            "partOfSpeech (en)" : "noun-pronoun-misc",
            "positionLength" : 1,
            "pronunciation" : "ワタシ",
            "pronunciation (en)" : "watashi",
            "reading" : "ワタシ",
            "reading (en)" : "watashi",
            "termFrequency" : 1
          },
          {
            "token" : "は",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "word",
            "position" : 1,
            "baseForm" : null,
            "bytes" : "[e3 81 af]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : "助詞-係助詞",
            "partOfSpeech (en)" : "particle-dependency",
            "positionLength" : 1,
            "pronunciation" : "ワ",
            "pronunciation (en)" : "wa",
            "reading" : "ハ",
            "reading (en)" : "ha",
            "termFrequency" : 1
          },
          {
            "token" : "日本人",
            "start_offset" : 2,
            "end_offset" : 5,
            "type" : "word",
            "position" : 2,
            "baseForm" : null,
            "bytes" : "[e6 97 a5 e6 9c ac e4 ba ba]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "partOfSpeech" : "名詞-一般",
            "partOfSpeech (en)" : "noun-common",
            "positionLength" : 1,
            "pronunciation" : "ニッポンジン",
            "pronunciation (en)" : "nipponjin",
            "reading" : "ニッポンジン",
            "reading (en)" : "nipponjin",
            "termFrequency" : 1
          },
          {
            "token" : "です",
            "start_offset" : 5,
            "end_offset" : 7,
            "type" : "word",
            "position" : 3,
            "baseForm" : null,
            "bytes" : "[e3 81 a7 e3 81 99]",
            "inflectionForm" : "基本形",
            "inflectionForm (en)" : "base",
            "inflectionType" : "特殊・デス",
            "inflectionType (en)" : "special-desu",
            "partOfSpeech" : "助動詞",
            "partOfSpeech (en)" : "auxiliary-verb",
            "positionLength" : 1,
            "pronunciation" : "デス",
            "pronunciation (en)" : "desu",
            "reading" : "デス",
            "reading (en)" : "desu",
            "termFrequency" : 1
          }
        ]
      },
      {
        "name" : "kuromoji_baseform",
        "tokens" : [
          {
            "token" : "私",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "word",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e7 a7 81]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "名詞-代名詞-一般",
            "partOfSpeech (en)" : "noun-pronoun-misc",
            "positionLength" : 1,
            "pronunciation" : "ワタシ",
            "pronunciation (en)" : "watashi",
            "reading" : "ワタシ",
            "reading (en)" : "watashi",
            "termFrequency" : 1
          },
          {
            "token" : "は",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "word",
            "position" : 1,
            "baseForm" : null,
            "bytes" : "[e3 81 af]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "助詞-係助詞",
            "partOfSpeech (en)" : "particle-dependency",
            "positionLength" : 1,
            "pronunciation" : "ワ",
            "pronunciation (en)" : "wa",
            "reading" : "ハ",
            "reading (en)" : "ha",
            "termFrequency" : 1
          },
          {
            "token" : "日本人",
            "start_offset" : 2,
            "end_offset" : 5,
            "type" : "word",
            "position" : 2,
            "baseForm" : null,
            "bytes" : "[e6 97 a5 e6 9c ac e4 ba ba]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "名詞-一般",
            "partOfSpeech (en)" : "noun-common",
            "positionLength" : 1,
            "pronunciation" : "ニッポンジン",
            "pronunciation (en)" : "nipponjin",
            "reading" : "ニッポンジン",
            "reading (en)" : "nipponjin",
            "termFrequency" : 1
          },
          {
            "token" : "です",
            "start_offset" : 5,
            "end_offset" : 7,
            "type" : "word",
            "position" : 3,
            "baseForm" : null,
            "bytes" : "[e3 81 a7 e3 81 99]",
            "inflectionForm" : "基本形",
            "inflectionForm (en)" : "base",
            "inflectionType" : "特殊・デス",
            "inflectionType (en)" : "special-desu",
            "keyword" : false,
            "partOfSpeech" : "助動詞",
            "partOfSpeech (en)" : "auxiliary-verb",
            "positionLength" : 1,
            "pronunciation" : "デス",
            "pronunciation (en)" : "desu",
            "reading" : "デス",
            "reading (en)" : "desu",
            "termFrequency" : 1
          }
        ]
      },
      {
        "name" : "to_katakana",
        "tokens" : [
          {
            "token" : "私",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "word",
            "position" : 0,
            "baseForm" : null,
            "bytes" : "[e7 a7 81]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "名詞-代名詞-一般",
            "partOfSpeech (en)" : "noun-pronoun-misc",
            "positionLength" : 1,
            "pronunciation" : "ワタシ",
            "pronunciation (en)" : "watashi",
            "reading" : "ワタシ",
            "reading (en)" : "watashi",
            "termFrequency" : 1
          },
          {
            "token" : "ハ",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "word",
            "position" : 1,
            "baseForm" : null,
            "bytes" : "[e3 83 8f]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "助詞-係助詞",
            "partOfSpeech (en)" : "particle-dependency",
            "positionLength" : 1,
            "pronunciation" : "ワ",
            "pronunciation (en)" : "wa",
            "reading" : "ハ",
            "reading (en)" : "ha",
            "termFrequency" : 1
          },
          {
            "token" : "日本人",
            "start_offset" : 2,
            "end_offset" : 5,
            "type" : "word",
            "position" : 2,
            "baseForm" : null,
            "bytes" : "[e6 97 a5 e6 9c ac e4 ba ba]",
            "inflectionForm" : null,
            "inflectionForm (en)" : null,
            "inflectionType" : null,
            "inflectionType (en)" : null,
            "keyword" : false,
            "partOfSpeech" : "名詞-一般",
            "partOfSpeech (en)" : "noun-common",
            "positionLength" : 1,
            "pronunciation" : "ニッポンジン",
            "pronunciation (en)" : "nipponjin",
            "reading" : "ニッポンジン",
            "reading (en)" : "nipponjin",
            "termFrequency" : 1
          },
          {
            "token" : "デス",
            "start_offset" : 5,
            "end_offset" : 7,
            "type" : "word",
            "position" : 3,
            "baseForm" : null,
            "bytes" : "[e3 83 87 e3 82 b9]",
            "inflectionForm" : "基本形",
            "inflectionForm (en)" : "base",
            "inflectionType" : "特殊・デス",
            "inflectionType (en)" : "special-desu",
            "keyword" : false,
            "partOfSpeech" : "助動詞",
            "partOfSpeech (en)" : "auxiliary-verb",
            "positionLength" : 1,
            "pronunciation" : "デス",
            "pronunciation (en)" : "desu",
            "reading" : "デス",
            "reading (en)" : "desu",
            "termFrequency" : 1
          }
        ]
      }
    ]
  }
}

Elasticsearch のユーザ辞書の設定

2022-08-09 23:46:50 | elasticsearch

Elasticsearch でユーザ辞書を利用する方法のメモ。
■辞書（config/dic/user_dic.txt）

リバーシブル,リバーシブル,リバーシブル,カスタム名詞
ヨガマット,ヨガ マット,ヨガ マット,カスタム名詞

■settings（settings_dic.json）

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "kuromoji": {
          "type": "kuromoji_tokenizer",
          "user_dictionary": "dic/user_dic.txt"
        }
      },
      "analyzer": {
        "ja_analyzer": {
          "tokenizer": "kuromoji",
          "type": "custom",
          "mode":"search"
        }
      }
    }
  }
}

■settings_dic.json を反映

curl http://localhost:9200/settings_dic/?pretty \
     -XPUT \
     -H "Content-Type: application/json" \
     -T settings_dic.json

■解析実行

TEXT='リバーシブルヨガマットを買った'
curl -XGET \
     http://localhost:9200/settings_dic/_analyze?pretty \
     -H "Content-Type: application/json" \
     -d "
{
  \"analyzer\": \"ja_analyzer\",
  \"text\": \"${TEXT}\"
}"

■解析結果

{
  "tokens" : [
    {
      "token" : "リバーシブル",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ヨガ",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "マット",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "を",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "買っ",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "た",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 5
    }
  ]
}

■解析結果（辞書なしの場合）

{
  "tokens" : [
    {
      "token" : "リバー",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "リバーシブルヨガマット",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "シブルヨガマット",
      "start_offset" : 3,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "を",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "買っ",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "た",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    }
  ]
}

Elasticsearch 8.X では "knn" で類似ベクトル検索を行うことができます。
類似度の尺度としてここでは "cosine" を使いますが、"l2_norm" を指定することもできます。
それぞれ以下に基づく検索となります。
・cosine: ２ベクトルのコサイン
・l2_norm: ２ベクトル間の距離の二乗
cosine、l2_norm の両方を使いたい場合には、"vector_cos" と "vector_l2" に同じベクトルを登録しておき、用途に応じて検索対象を切り替えて使うということもできます。
■スキーマ（create_knn_test1.json）

{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id": {"type": "keyword", "store": "true"},
      "vector": {"type": "dense_vector", "dims": 2, "index": true, "similarity": "cosine"}
    }
  }
}

インデックス作成

$ curl 'http://localhost:9200/knn_test1/?pretty' \
    -X PUT \
    -H 'Content-Type: application/json' \
    -T create_knn_test1.json

■データ登録例（bulk_knn_test1.jsonl）

{"index": {"_index": "knn_test1", "_id": "id_01"}}
{"id": "id_01", "vector": [1.0, 0.50]}
{"index": {"_index": "knn_test1", "_id": "id_02"}}
{"id": "id_02", "vector": [1.0, 0.60]}
{"index": {"_index": "knn_test1", "_id": "id_03"}}
{"id": "id_03", "vector": [1.0, 0.70]}
{"index": {"_index": "knn_test1", "_id": "id_04"}}
{"id": "id_04", "vector": [1.0, 0.80]}
{"index": {"_index": "knn_test1", "_id": "id_05"}}
{"id": "id_05", "vector": [1.0, 0.90]}
{"index": {"_index": "knn_test1", "_id": "id_06"}}
{"id": "id_06", "vector": [1.0, 1.00]}
{"index": {"_index": "knn_test1", "_id": "id_07"}}
{"id": "id_07", "vector": [1.0, 1.15]}
{"index": {"_index": "knn_test1", "_id": "id_08"}}
{"id": "id_08", "vector": [1.0, 1.25]}
{"index": {"_index": "knn_test1", "_id": "id_09"}}
{"id": "id_09", "vector": [1.0, 1.35]}
{"index": {"_index": "knn_test1", "_id": "id_10"}}
{"id": "id_10", "vector": [1.0, 1.45]}

登録

$ curl "http://localhost:9200/knn_test1/_bulk?pretty" \
     -X POST \
     -H 'Content-Type: application/x-ndjson' \
     -T bulk_knn_test1.jsonl

■検索プログラム（k=3, num_candidates=5）

import sys
import json
from elasticsearch import Elasticsearch

url = 'http://localhost:9200'
es = Elasticsearch(url)

q = {
    'knn': {
        'field': 'vector',
        'query_vector': [1.0, 1.0],
        'k': 3,
        'num_candidates': 5,
    },
    'fields': ['id']
}

res = es.knn_search(index='knn_test1', body=q)
print(json.dumps(res['hits']['hits'], indent=2))

■実行結果（k=3, num_candidates=5）
k=3, num_candidates=5 の場合、3レコードのみが返却されます。

[
  {
    "id": "id_06",
    "score": 1.0
  },
  {
    "id": "id_05",
    "score": 0.99930894
  },
  {
    "id": "id_07",
    "score": 0.9987876
  }
]

■パラメータ変更（k=5, num_candidates=5）
k=5, num_candidates=5 にプログラムを変更します。

q = {
    'knn': {
        'field': 'vector',
        'query_vector': [1.0, 1.0],
        'k': 5,
        'num_candidates': 5,
    },
    'fields': ['id']
}

res = es.knn_search(index='knn_test1', body=q)
items = [
    {'id': item['_id'], 'score': item['_score']}
    for item in res['hits']['hits']
]
print(json.dumps(items, indent=2))

■実行結果（k=5, num_candidates=5）
k=5, num_candidates=5 に変更すると、検索結果は 5件になります。

[
  {
    "id": "id_06",
    "score": 1.0
  },
  {
    "id": "id_05",
    "score": 0.99930894
  },
  {
    "id": "id_07",
    "score": 0.9987876
  },
  {
    "id": "id_08",
    "score": 0.99694186
  },
  {
    "id": "id_04",
    "score": 0.9969418
  }
]

Elasticsearch 8.X を http で起動

2022-06-05 22:29:38 | elasticsearch

Elasticsearch 8.X で security 設定を外して起動する方法のメモ。

config/elasticsearch.yml に以下を記述して elasticsearch を起動します。

discovery.type: single-node
xpack.security.enabled: false
xpack.security.audit.enabled: false
xpack.security.transport.ssl.enabled: false

Elasticsearch で role、user を作成する方法

2021-11-03 23:56:29 | elasticsearch

Elasticsearch で role、user を作成する方法のメモ。

role、user を有効にするには、config/elasticsearch.yml に以下を設定してから
elasticsearch を起動します。

xpack.security.enabled: true

全インデックスに対する read 権限を持つ read-role を作成。

curl "http://localhost:9200/_security/role/read-role" \
     -u elastic:espasswd \
     -X PUT \
     -H 'Content-Type: application/json' \
     -d "
{
  \"indices\": [
  {
    \"names\": [\"*\"],
    \"privileges\": [\"read\"]
  }
  ]
}"

上記の names でインデックス名を指定すると、そのインデックスに対してのみ読み込みを許可することができます。

read-role を持つ read-user を作成。

url 'http://localhost:9200/_security/user/read-user' \
     -u elastic:espasswd \
     -X PUT \
     -H 'Content-Type: application/json' \
     --data '
{
        "enabled": true,
        "password": "es-read-user-passwd",
        "roles": ["read-role"]
}'

Elasticsearch でカタカナとひらがなを区別せずに検索

2021-10-21 22:58:22 | elasticsearch

Elasticsearch でカタカナとひらがなを区別せずに検索できるようにする方法のメモ。

日本語では「リンゴ」と「りんご」など、カタカナでもひらがなの両方で表記する単語があります。
ひらがなで検索してもカタカナで表記されている文書を漏れなく検索できるようにするために、
ひらがなをカタカナに変換する filter を設定します。

まず、日本語解析器の設定を行います。

curl "http://localhost:9200/test1?pretty" \
     -XPUT \
     -H 'Content-Type: application/json' \
     -d '
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ja_text_analyzer1": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "icu_normalizer",
            "kuromoji_baseform",
            "to_katakana"
          ]
        }
      },
      "filter": {
        "to_katakana": {
          "type": "icu_transform",
          "id": "Hiragana-Katakana"
        }
      }
    }
  }
}
'

Hiragana-Katakana の filter によって、ひらがなをカタカナに変換します。
実際に日本語文字列を解析すると、以下のようにひらがながカタカナに変換された解析結果が得られます。

curl "http://localhost:9200/test1/_analyze?pretty" \
     -XGET \
     -H 'Content-Type: application/json' \
     -v \
     --data '
{
     "analyzer": "ja_text_analyzer1",
     "text": "私は日本人です"
}'

解析結果では、ひらがながカタカナに変換されています。

{
  "tokens" : [
    {
      "token" : "私",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ハ",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "日本人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "デス",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    }
  ]
}

proxy 経由での kibana の公開

2021-10-19 21:31:53 | elasticsearch

proxy 経由で kibana を公開する方法のメモ。

apache のリバースプロキシを使って以下の URL で kibana にアクセスできるようにします。
http://xxx/elastic/kibana/

apache、kibana を以下のように設定すると、上記の URL で kibana にアクセスできるようになります。
■httpd.conf の設定

<Location "/elastic/kibana">
ProxyPass http://127.0.0.1:5601/elastic/kibana
ProxyPassReverse http://127.0.0.1:5601/elastic/kibana
</Location>

■kibana の config/kibana.yml の設定

server.rewriteBasePath: true
server.basePath: "/elastic/kibana"

記事一覧 | 画像一覧 | フォロワー一覧 | フォトチャンネル一覧

検索

バックナンバー

2025年01月

2024年12月

2024年11月

2024年10月

2024年09月

2024年08月

2024年07月

2024年06月

2024年05月

2024年04月

2024年03月

2024年02月

2024年01月

2023年12月

2023年11月

2023年10月

2023年09月

2023年08月

2023年07月

2023年05月

2023年04月

2023年03月

2023年02月

2023年01月

2022年12月

2022年11月

2022年10月

2022年09月

2022年08月

2022年07月

2022年06月

2022年05月

2022年04月

2022年03月

2022年02月

2022年01月

2021年12月

2021年11月

2021年10月

2021年09月

2021年07月

2021年06月

2021年04月

2021年03月

2021年02月

2021年01月

2020年11月

2020年09月

2020年08月

2020年07月

2020年06月

2020年05月

2020年04月

2020年03月

2020年02月

2019年12月

2019年11月

2019年10月

2019年09月

2019年08月

2019年07月

2019年06月

2019年04月

2019年02月

2019年01月

2018年12月

2018年11月

2018年10月

2018年09月

2018年07月

2018年06月

2013年09月

2013年06月

2012年07月

2012年06月

2012年05月

2012年01月

2011年11月

2011年09月

2011年08月

2011年07月

2011年06月

2011年05月

2011年04月

2011年03月

2011年02月

2011年01月

2010年12月

2010年11月

2007年05月

2007年03月

2007年02月

2007年01月

2006年12月

2006年11月

2006年10月

2006年09月

2006年08月

2006年07月

2006年06月

2006年05月

2006年04月

2006年03月

カレンダー

2025年1月
日	月	火	水	木	金	土
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

前月

次月

goo blog おすすめ

	おすすめブログ
	【コメント募集中】イチローさんの最も印象的なプレー・出来事は？

@goo_blog

お客さまのご利用端末からの情報の外部送信について

goo blog お知らせ

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】イチローさんの最も印象的なプレー・出来事は？
	訪問者数に応じてdポイント最大1,000pt当たる！
	dポイントが当たる！無料『毎日くじ』

dak ブログ

python、rubyなどのプログラミング、MySQL、サーバーの設定などの備忘録。レゴの写真も。

curl で elastic cloud にアクセス

TypeScript で elastic cloud に接続

Elasticsearch で nested フィールドを検索

Elasticsearch で複数のフィールドでグループ化してカウント

Elasticsearch で数値のフィールドの値によるスコアリング

Elastcisearch でフィールド毎に重みづけしたスコアリング

Elasticsearch で aggs を用いた facet 検索

Elasticsearch での kuromoji での同義語辞書

Elasticsearch での形態素解析

Elasticsearch のユーザ辞書の設定

Elasticsearch 8.X で類似ベクトル検索

Elasticsearch 8.X を http で起動

Elasticsearch で role、user を作成する方法

Elasticsearch でカタカナとひらがなを区別せずに検索

proxy 経由での kibana の公開

検索

最新記事

カテゴリー

バックナンバー

カレンダー

goo blog おすすめ

goo blog お知らせ

python、rubyなどのプログラミング、MySQL、サーバーの設定などの備忘録。レゴの写真も。

検索

最新記事

カテゴリー

バックナンバー

カレンダー

ログイン

goo blog おすすめ

goo blog お知らせ