2022年8月のブログ記事一覧-dak ブログ

MySQL で主キーではないカラムでユニーク化

2022-08-30 23:27:00 | MySQL

MySQL で主キーではないカラムでユニーク化する方法のメモ。
以下のテーブルで主キーではない id でユニーク化します。

create table test1 (
  id    varchar(16) not null,
  key1      varchar(16) not null,
  key2      varchar(16) not null,
  key3      varchar(16) not null,
  data      varchar(16) not null default '',

  primary key(key1, key2, key3),
  index(id)
);

insert into test1 set id = 'prd_01', key1 = 'a', key2 = 'a', key3 = '01', data = 'a';
insert into test1 set id = 'prd_01', key1 = 'a', key2 = 'a', key3 = '02', data = 'a';

insert into test1 set id = 'prd_02', key1 = 'b', key2 = 'b', key3 = '01', data = 'b';
insert into test1 set id = 'prd_02', key1 = 'b', key2 = 'b', key3 = '02', data = 'b';
insert into test1 set id = 'prd_02', key1 = 'b', key2 = 'b', key3 = '03', data = 'b';

insert into test1 set id = 'prd_03', key1 = 'c', key2 = 'c', key3 = '01', data = 'c';
insert into test1 set id = 'prd_03', key1 = 'c', key2 = 'c', key3 = '02', data = 'c';

以下では主キーの key1、key2、key3 を \t で連結し、文字列的に最小の値となるレコードでユニーク化します。

select
  t1.id as id
  , t1.data as data
from
  test1 t1
left join
(
  select
    id
    , min(concat(key1, '\t', key2, '\t', key3)) as k
  from
    test1
  group by
    id
) tk
on
  tk.k = concat(t1.key1, '_', t1.key2, '_', t1.key3)
where
  tk.k is not null
;

■実行結果

id      data
prd_01  a
prd_02  b
prd_03  c

MySQL の起動方法

2022-08-30 23:23:57 | MySQL

MySQL の起動方法のメモ。

sudo systemctl start mysqld

jsdom で属性を取得

2022-08-26 23:06:42 | Node.js

jsdom で属性を取得する方法のメモ。
getAttributeNames() で属性名のリストを取得できます。
そして、getAttribute(属性名) で属性値を取得することができます。

const { JSDOM } = require('jsdom');

const html = `
<html>
<head>
<title>test</title>
<meta keyword="kw1,kw2,kw3">
</head>
<body>
abc
<img src="image1.png" alt="image1"/>
<div>div1-1</div>
xyz
</body>
</html>
`;

const dom = new JSDOM(html);
const doc = dom.window.document;

function get_texts(node, texts=[]) {
  console.log("--");
  console.log("nodeType: " + node.nodeType);
  console.log("nodeName: " + node.nodeName);
  console.log("nodeValue: " + node.nodeValue);

  if (node.nodeType === node.TEXT_NODE) {
    texts.push(node.nodeValue);
  }
  else if (node.nodeType === node.ELEMENT_NODE) {
    console.log("attrs: " + node.getAttributeNames());
    console.log("alt: " + node.getAttribute('alt'));

    for (const child of node.childNodes) {
      get_texts(child, texts);
    }
  }

  return texts;
}

const texts = get_texts(doc.body);
console.log(texts);

■実行結果

--
nodeType: 1
nodeName: BODY
nodeValue: null
attrs:
alt: null
--
nodeType: 3
nodeName: #text
nodeValue:
abc

--
nodeType: 1
nodeName: IMG
nodeValue: null
attrs: src,alt
alt: image1
--
nodeType: 3
nodeName: #text
nodeValue:

--
nodeType: 1
nodeName: DIV
nodeValue: null
attrs:
alt: null
--
nodeType: 3
nodeName: #text
nodeValue: div1-1
--
nodeType: 3
nodeName: #text
nodeValue:
xyz

jsonl のデータをオブジェクトとして MySQL に格納

2022-08-20 18:07:52 | MySQL

jsonl 形式のデータをオブジェクトとして mysql に格納する方法のメモ。
以下の jsonl のデータを MySQL に格納します。
■登録データ

{"id": "01", "x": 30, "y": 170}
{"id": "02", "x": 20, "y": 180}
{"id": "03", "x": 25, "y": 160}
{"id": "04", "x": 40, "y": 175}
{"id": "05", "x": 45, "y": 180}

■MySQL のテーブル定義

create table test1 (
  obj  json,
  id   varchar(16) generated always as (json_unquote(obj->"$.id")) stored,
  x    integer generated always as (json_unquote(obj->"$.x")) stored,
  y    integer generated always as (json_unquote(obj->"$.y")) stored,

  primary key (id),
  index (x, y),
  index (y, x)
);

MySQL に登録する際に id を primary key とし、x、y にはインデックスを張ります。

■MySQL へのデータ登録
上記の jsonl 形式のデータを /var/lib/mysql-files/test1.jsonl に保存し、MySQL に load します。
このとき、fields terminated by '\t' を指定することで jsonl のデータが分割されないようにします。

load data infile '/var/lib/mysql-files/test1.jsonl' ignore into table test1 fields terminated by '\t' (obj);

■登録内容の確認

select * from test1;

obj     id      x       y
{"x": 30, "y": 170, "id": "01"} 01      30      170
{"x": 20, "y": 180, "id": "02"} 02      20      180
{"x": 25, "y": 160, "id": "03"} 03      25      160
{"x": 40, "y": 175, "id": "04"} 04      40      175
{"x": 45, "y": 180, "id": "05"} 05      45      180

■primary key の確認
以下の通り、id を指定した検索では PRIMARY KEY が利用されていることがわかります。

explain select id, x, y from test1 where id = '03'\G;

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: test1
   partitions: NULL
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 66
          ref: const
         rows: 1
     filtered: 100.00
        Extra: NULL

MySQL での json 型のデータへのインデックス

2022-08-20 17:32:12 | MySQL

MySQL で json 型のデータにインデックスを張る方法のメモ。
テーブル定義でインデックスを張りたいデータ項目(x) に "generated always as (json_unquote(obj->"$.x")) stored" を指定します。
■テーブル定義

create table test1 (
  id   varchar(16) not null,
  obj  json,
  x    integer generated always as (json_unquote(obj->"$.x")) stored,
  y    integer generated always as (json_unquote(obj->"$.y")) stored,

  primary key (id),
  index (x, y),
  index (y, x)
);

■データ登録

insert into test1 set id = '01', obj = '{"id": "01", "x": 30, "y": 170}';
insert into test1 set id = '02', obj = '{"id": "02", "x": 20, "y": 180}';
insert into test1 set id = '03', obj = '{"id": "03", "x": 25, "y": 160}';
insert into test1 set id = '04', obj = '{"id": "04", "x": 40, "y": 175}';
insert into test1 set id = '05', obj = '{"id": "05", "x": 45, "y": 180}';

■検索

select id, x from test1 where x >= 30\G;

*************************** 1. row ***************************
id: 01
 x: 30
*************************** 2. row ***************************
id: 04
 x: 40
*************************** 3. row ***************************
id: 05
 x: 45

■実行計画
上記のクエリでは key に x が使われていることがわかります。

explain select id, x from test1 where x >= 30\G;

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: test1
   partitions: NULL
         type: range
possible_keys: x,y
          key: x
      key_len: 5
          ref: NULL
         rows: 3
     filtered: 100.00
        Extra: Using where; Using index

Node.js での jsdom を利用した HTML からのテキスト抽出

2022-08-20 16:16:19 | Node.js

Node.js で jsdom を利用して HTML からテキストを抽出する方法のメモ。
dom = new JSDOM(...) で DOM を生成すると、dom.window.document で HTML の DOM ツリーを参照できます。
nodeType が TEXT_NODE の場合に、nodeValue に格納されているテキストを抽出します。

■プログラム

const jsdom = require('jsdom');

// 以下が対象の HTML
const html = `
<html>
<head>
<title>test</title>
<meta keyword="kw1, kw2, kw3">
</head>
<body>
abc
<div>div1-1</div>
<div>div2-1<div>div2_2-1</div>div2-2</div>
<div>div3-1</div>
xyz
</body>
</html>
`;

// DOM 生成
const dom = new jsdom.JSDOM(html);

// 再帰的に DOM ツリーをたどってテキストを抽出
function get_texts(node, texts=[]) {
  if (node.nodeType === node.TEXT_NODE) {
    texts.push(node.nodeValue);
  }
  else if (node.nodeType === node.ELEMENT_NODE) {
    for (const child of node.childNodes) {
      get_texts(child, texts);
    }
  }

  return texts;
}

const doc = dom.window.document;
const texts = get_texts(doc.body);
console.log(texts);

■実行結果

[
  '\nabc\n',
  'div1-1',
  '\n',
  'div2-1',
  'div2_2-1',
  'div2-2',
  '\n',
  'div3-1',
  '\nxyz\n\n\n'
]

Elasticsearch のユーザ辞書の設定

2022-08-09 23:46:50 | elasticsearch

Elasticsearch でユーザ辞書を利用する方法のメモ。
■辞書（config/dic/user_dic.txt）

リバーシブル,リバーシブル,リバーシブル,カスタム名詞
ヨガマット,ヨガ マット,ヨガ マット,カスタム名詞

■settings（settings_dic.json）

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "kuromoji": {
          "type": "kuromoji_tokenizer",
          "user_dictionary": "dic/user_dic.txt"
        }
      },
      "analyzer": {
        "ja_analyzer": {
          "tokenizer": "kuromoji",
          "type": "custom",
          "mode":"search"
        }
      }
    }
  }
}

■settings_dic.json を反映

curl http://localhost:9200/settings_dic/?pretty \
     -XPUT \
     -H "Content-Type: application/json" \
     -T settings_dic.json

■解析実行

TEXT='リバーシブルヨガマットを買った'
curl -XGET \
     http://localhost:9200/settings_dic/_analyze?pretty \
     -H "Content-Type: application/json" \
     -d "
{
  \"analyzer\": \"ja_analyzer\",
  \"text\": \"${TEXT}\"
}"

■解析結果

{
  "tokens" : [
    {
      "token" : "リバーシブル",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ヨガ",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "マット",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "を",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "買っ",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "た",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 5
    }
  ]
}

■解析結果（辞書なしの場合）

{
  "tokens" : [
    {
      "token" : "リバー",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "リバーシブルヨガマット",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "シブルヨガマット",
      "start_offset" : 3,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "を",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "買っ",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "た",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    }
  ]
}

Elasticsearch 8.X で類似ベクトル検索

2022-08-07 18:29:09 | elasticsearch

Elasticsearch 8.X では "knn" で類似ベクトル検索を行うことができます。
類似度の尺度としてここでは "cosine" を使いますが、"l2_norm" を指定することもできます。
それぞれ以下に基づく検索となります。
・cosine: ２ベクトルのコサイン
・l2_norm: ２ベクトル間の距離の二乗
cosine、l2_norm の両方を使いたい場合には、"vector_cos" と "vector_l2" に同じベクトルを登録しておき、用途に応じて検索対象を切り替えて使うということもできます。
■スキーマ（create_knn_test1.json）

{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id": {"type": "keyword", "store": "true"},
      "vector": {"type": "dense_vector", "dims": 2, "index": true, "similarity": "cosine"}
    }
  }
}

インデックス作成

$ curl 'http://localhost:9200/knn_test1/?pretty' \
    -X PUT \
    -H 'Content-Type: application/json' \
    -T create_knn_test1.json

■データ登録例（bulk_knn_test1.jsonl）

{"index": {"_index": "knn_test1", "_id": "id_01"}}
{"id": "id_01", "vector": [1.0, 0.50]}
{"index": {"_index": "knn_test1", "_id": "id_02"}}
{"id": "id_02", "vector": [1.0, 0.60]}
{"index": {"_index": "knn_test1", "_id": "id_03"}}
{"id": "id_03", "vector": [1.0, 0.70]}
{"index": {"_index": "knn_test1", "_id": "id_04"}}
{"id": "id_04", "vector": [1.0, 0.80]}
{"index": {"_index": "knn_test1", "_id": "id_05"}}
{"id": "id_05", "vector": [1.0, 0.90]}
{"index": {"_index": "knn_test1", "_id": "id_06"}}
{"id": "id_06", "vector": [1.0, 1.00]}
{"index": {"_index": "knn_test1", "_id": "id_07"}}
{"id": "id_07", "vector": [1.0, 1.15]}
{"index": {"_index": "knn_test1", "_id": "id_08"}}
{"id": "id_08", "vector": [1.0, 1.25]}
{"index": {"_index": "knn_test1", "_id": "id_09"}}
{"id": "id_09", "vector": [1.0, 1.35]}
{"index": {"_index": "knn_test1", "_id": "id_10"}}
{"id": "id_10", "vector": [1.0, 1.45]}

登録

$ curl "http://localhost:9200/knn_test1/_bulk?pretty" \
     -X POST \
     -H 'Content-Type: application/x-ndjson' \
     -T bulk_knn_test1.jsonl

■検索プログラム（k=3, num_candidates=5）

import sys
import json
from elasticsearch import Elasticsearch

url = 'http://localhost:9200'
es = Elasticsearch(url)

q = {
    'knn': {
        'field': 'vector',
        'query_vector': [1.0, 1.0],
        'k': 3,
        'num_candidates': 5,
    },
    'fields': ['id']
}

res = es.knn_search(index='knn_test1', body=q)
print(json.dumps(res['hits']['hits'], indent=2))

■実行結果（k=3, num_candidates=5）
k=3, num_candidates=5 の場合、3レコードのみが返却されます。

[
  {
    "id": "id_06",
    "score": 1.0
  },
  {
    "id": "id_05",
    "score": 0.99930894
  },
  {
    "id": "id_07",
    "score": 0.9987876
  }
]

■パラメータ変更（k=5, num_candidates=5）
k=5, num_candidates=5 にプログラムを変更します。

q = {
    'knn': {
        'field': 'vector',
        'query_vector': [1.0, 1.0],
        'k': 5,
        'num_candidates': 5,
    },
    'fields': ['id']
}

res = es.knn_search(index='knn_test1', body=q)
items = [
    {'id': item['_id'], 'score': item['_score']}
    for item in res['hits']['hits']
]
print(json.dumps(items, indent=2))

■実行結果（k=5, num_candidates=5）
k=5, num_candidates=5 に変更すると、検索結果は 5件になります。

[
  {
    "id": "id_06",
    "score": 1.0
  },
  {
    "id": "id_05",
    "score": 0.99930894
  },
  {
    "id": "id_07",
    "score": 0.9987876
  },
  {
    "id": "id_08",
    "score": 0.99694186
  },
  {
    "id": "id_04",
    "score": 0.9969418
  }
]

python で xlsx ファイルから tsv ファイルを生成

2022-08-07 17:13:21 | python

python で xlsx ファイルから tsv ファイルを生成する方法のメモ。
pandas の to_csv() で sep='\t' でセパレータにタブを指定することで、xlsx ファイルを tsv に変換することができます。

import sys
import pandas as pd

xlsx_file = 'test1.xlsx'
df = pd.read_excel(xlsx_file, sheet_name=0)
df.to_csv(sys.stdout, header=True, index=False, encoding='utf-8', sep='\t')

python で実行中のメソッド名を取得する方法

2022-08-01 23:45:53 | python

python で実行中のメソッド名を取得する方法のメモ。
実行中のメソッド名は inspect.currentframe().f_code.co_name で取得できます。

■プログラム

import inspect

def method1():
    print(inspect.currentframe().f_code.co_name)
    return

def method2():
    print(inspect.currentframe().f_code.co_name)
    return

def main():
    method1()
    method2()
    return 0

if __name__ == '__main__':
    res = main()
    exit(res)

■実行結果

method1
method2

記事一覧 | 画像一覧 | フォロワー一覧 | フォトチャンネル一覧

検索

バックナンバー

2025年03月

2025年02月

2025年01月

2024年12月

2024年11月

2024年10月

2024年09月

2024年08月

2024年07月

2024年06月

2024年05月

2024年04月

2024年03月

2024年02月

2024年01月

2023年12月

2023年11月

2023年10月

2023年09月

2023年08月

2023年07月

2023年05月

2023年04月

2023年03月

2023年02月

2023年01月

2022年12月

2022年11月

2022年10月

2022年09月

2022年08月

2022年07月

2022年06月

2022年05月

2022年04月

2022年03月

2022年02月

2022年01月

2021年12月

2021年11月

2021年10月

2021年09月

2021年07月

2021年06月

2021年04月

2021年03月

2021年02月

2021年01月

2020年11月

2020年09月

2020年08月

2020年07月

2020年06月

2020年05月

2020年04月

2020年03月

2020年02月

2019年12月

2019年11月

2019年10月

2019年09月

2019年08月

2019年07月

2019年06月

2019年04月

2019年02月

2019年01月

2018年12月

2018年11月

2018年10月

2018年09月

2018年07月

2018年06月

2013年09月

2013年06月

2012年07月

2012年06月

2012年05月

2012年01月

2011年11月

2011年09月

2011年08月

2011年07月

2011年06月

2011年05月

2011年04月

2011年03月

2011年02月

2011年01月

2010年12月

2010年11月

2007年05月

2007年03月

2007年02月

2007年01月

2006年12月

2006年11月

2006年10月

2006年09月

2006年08月

2006年07月

2006年06月

2006年05月

2006年04月

2006年03月

カレンダー

2022年8月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

前月

次月

goo blog おすすめ

	おすすめブログ
	【コメント募集中】自宅で「蛍光灯」を使っていますか？

@goo_blog

お客さまのご利用端末からの情報の外部送信について

goo blog お知らせ

	【PR】プロ直伝・dポイントをザクザクためる術
	【PR】安い＆大量の「訳あり商品」がヤバい！
	【コメント募集中】自宅で「蛍光灯」を使っていますか？
	訪問者数に応じてdポイント最大1,000pt当たる！
	dポイントが当たる！無料『毎日くじ』

dak ブログ

python、rubyなどのプログラミング、MySQL、サーバーの設定などの備忘録。レゴの写真も。

MySQL で主キーではないカラムでユニーク化

MySQL の起動方法

jsdom で属性を取得

jsonl のデータをオブジェクトとして MySQL に格納

MySQL での json 型のデータへのインデックス

Node.js での jsdom を利用した HTML からのテキスト抽出

Elasticsearch のユーザ辞書の設定

Elasticsearch 8.X で類似ベクトル検索

python で xlsx ファイルから tsv ファイルを生成

python で実行中のメソッド名を取得する方法

検索

最新記事

カテゴリー

バックナンバー

カレンダー

goo blog おすすめ

goo blog お知らせ

python、rubyなどのプログラミング、MySQL、サーバーの設定などの備忘録。レゴの写真も。

検索

最新記事

カテゴリー

バックナンバー

カレンダー

ログイン

goo blog おすすめ

goo blog お知らせ