2019年9月18日のブログ記事一覧-ウィリアムのいたずらの、まちあるき、たべあるき

python(gensim)でＬＤＡ（潜在的ディリクレ配分法）やるサンプル

2019-09-18 09:02:24 | Weblog

こんなかんじ

####################################
#ファイル読み込み
####################################
file_name='nihongo.csv'
file=open(file_name)
fr = file.readlines()

#空白行を削除
fdata = []
for row in fr:
    row = row.rstrip()
    if not row:
        continue
    fdata.append(row)

file.close()    


####################################
#形態素解析してデータ取り出し
####################################
from janome.tokenizer import Tokenizer
t = Tokenizer()

DATA=[]

for rec in fdata:
    token = t.tokenize(str(rec))
    output=[]

    for w in token:
        if w.part_of_speech.find("名詞") >= 0:
            output.append(w.surface)
        elif w.part_of_speech.find("動詞") >= 0:
            output.append(w.base_form)

    DATA.append(output)


####################################
#データ確認
####################################
for rec in DATA:         
    print(rec)   

####################################
#dataをtestとtrainに
####################################
import random
import math

half=math.floor(len(DATA)/2)
train=DATA[:half]
test=DATA[half:]

####################################
# LDAモデル作成 
####################################
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel

dictionary=Dictionary(train)
print(dictionary)

# filtering
dictionary.filter_extremes(no_below=2,no_above=0.5)

#トークンをＩＤ表現に
token_2_id=dictionary.token2id
id_2_token={}
for key in token_2_id.keys():
    id_2_token[token_2_id[key]]=key
print(len(id_2_token))
print(id_2_token)

#訓練データをID表現に
input_data=[dictionary.doc2bow(text) for text in train]
print(len(input_data))
print(input_data)

#ＬＤＡ実施（トピック分け）
n=5   #トピックを５とする
lda=LdaModel(input_data,num_topics=n,id2word=id_2_token)
print(lda.bound(input_data))
print(lda.log_perplexity(input_data))

#トピックのうちわけ
for i in range(n):
    print(i,' ',lda.print_topic(i))
    print()

####################################
# LDAモデルから対象文の分散表現 
####################################
test_data=[dictionary.doc2bow(text) for text in test]
res_inference=lda.inference(test_data)
for i in range(len(res_inference[0])):
    print(i,test[i],":",res_inference[0][i])

半分に分けて、片方をもとにトピックを作っているので、
毎回、トピックが動いてしまう。trainのほうは、トピックつくるための
固定したものでよさげ

ランキングに参加中。クリックして応援お願いします！

記事一覧 | 画像一覧 | フォロワー一覧 | フォトチャンネル一覧

日	月	火	水	木	金	土
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

アクセス
閲覧	847	PV
訪問者	579	IP
トータル
閲覧	34,804,242	PV
訪問者	8,673,655	IP
ランキング
日別	993	位
週別	741	位

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】運転免許、「MT」・「AT限定」どっち？
	訪問者数に応じてdポイント最大1,000pt当たる！
	dポイントが当たる！無料『毎日くじ』

ウィリアムのいたずらの、まちあるき、たべあるき

ウィリアムのいたずらが、街歩き、食べ物、音楽等の個人的見解を主に書くブログです（たま～にコンピューター関係も）

python(gensim)でＬＤＡ（潜在的ディリクレ配分法）やるサンプル

カレンダー

ブログランキング

アクセス状況

プロフィール

最新記事

カテゴリー

最新コメント

バックナンバー

ブックマーク

goo blog おすすめ

goo blog お知らせ

ウィリアムのいたずらの、まちあるき、たべあるき

ウィリアムのいたずらが、街歩き、食べ物、音楽等の個人的見解を主に書くブログです（たま～にコンピューター関係も）

python(gensim)でＬＤＡ（潜在的ディリクレ配分法）やるサンプル

カレンダー

ログイン

ブログランキング

アクセス状況

プロフィール

最新記事

カテゴリー

最新コメント

バックナンバー

ブックマーク

goo blog おすすめ

goo blog お知らせ