2015年10月のブログ記事一覧-裏 RjpWiki

言語処理100本ノック第2章 : UNIXコマンドの基礎

2015年10月28日 | ブログラミング

東北大学の乾・岡崎研究室で公開されている言語処理100本ノック（2015年版）http://www.cl.ecei.tohoku.ac.jp/nlp100/ を、R言語で解く。

同趣旨のページ https://rpubs.com/yamano357/85313 では，

なお，ファイルの内容により，何通りも書き方はあるので，その 1 例と言うことで（必ずしも最適解ではない可能性がある）

10. 行数のカウント

行数をカウントせよ。確認には wc コマンドを用いよ。

command line:
wc -l hightemp.txt

R:
length(readLines("hightemp.txt"))
nrow(read.table("hightemp.txt", header=FALSE))

AWK one liner:
gawk 'END{print NR}' hightemp.txt

11. タブをスペースに置換

タブ 1 文字につきスペース 1 文字に置換せよ．確認には sed コマンド，tr コマンド，もしくは expand コマンドを用いよ．

command line:
tr '\t' ' ' < hightemp.txt

R:
gsub("\t", " ", readLines("hightemp.txt"))

AWK one liner:
gawk '{gsub("\t", " ", $0);print}' hightemp.txt

12. 1 列目を col1.txt に，2 列目を col2.txt に保存

各行の 1 列目だけを抜き出したものを col1.txt に，2 列目だけを抜き出したものを col2.txt としてファイルに保存せよ。確認には cut コマンドを用いよ。

command line:
cut -f 1 hightemp.txt > col1.txt
cut -f 2 hightemp.txt > col2.txt

R:
d = read.table("hightemp.txt", header=FALSE, as.is=TRUE)
write(d[,1], "col1.txt")
write(d[,2], "col2.txt")

AWK one liner:
gawk '{print $1 > "col1.txt"; print $2 > "col2.txt"}' hightemp.txt

13. col1.txt と col2.txt をマージ

12 で作った col1.txt と col2.txt を結合し，元のファイルの 1 列目と 2 列目をタブ区切りで並べたテキストファイルを作成せよ。確認には paste コマンドを用いよ。

command line:
paste col1.txt col2.txt > merge.txt

R:
write(paste(readLines("col1.txt"), readLines("col2.txt"), sep="\t"), "merge.txt")

AWK one liner:
gawk '{getline a < "col2.txt"; print $0, a}' col1.txt > merge.txt

14. 先頭から N 行を出力

自然数 N をコマンドライン引数などの手段で受け取り，入力のうち先頭のN行だけを表示せよ。確認には head コマンドを用いよ。

command line:
head -5 hightemp.txt

R:
readLines("hightemp.txt", 5)

AWK one liner:
gawk -v N=5 'FNR < N' hightemp.txt

15. 末尾の N 行を出力

自然数 N をコマンドライン引数などの手段で受け取り，入力のうち末尾のN行だけを表示せよ。確認には tail コマンドを用いよ。

command line:
tail -5 hightemp.txt

R:
tail(readLines("hightemp.txt"), 5)

AWK one liner:
gawk -v N=6 '{a[NR]=$0} END {for (i = NR-N+1; i <= NR; i++) print a[i]}' hightemp.txt

16. ファイルを N 分割する

自然数 N をコマンドライン引数などの手段で受け取り，入力のファイルを N 行ずつのファイルに分割せよ。この処理を split コマンドで実現せよ。

command line: ファイル名は順次 part-a, part-b, ... となる
split -a 1 -l 12 hightemp.txt part-

R: ファイル名は順次 part-1, part-2, ... となる
d = readLines("hightemp.txt")
N = 12
no = 0
for (i in seq_along(d)) {
if ((i-1)%%N == 0) {
   no = no+1
   fn = sprintf("part-%i", no)
   APPEND = FALSE
}
write(d[i], file=fn, sep="", append=APPEND)
APPEND = TRUE
}

AWK one liner: ファイル名は順次 part-1, part-2, ... となる
awk -v N=12 '{m=(NR-1)/N; if (m == int(m)) fn="part-" ++no; print $0 > fn}' hightemp.txt

17. １列目の文字列の異なり

1 列目の文字列の種類（異なる文字列の集合）を求めよ。確認には sort, uniq コマンドを用いよ。

command line:
cut -f 1 hightemp.txt | sort | uniq
cut -f 1 hightemp.txt | sort | uniq | wc -w

R:
d = read.table("hightemp.txt", header=FALSE, as.is=TRUE)
sort(unique(d[,1]))
length(sort(unique(d[,1])))

AWK one liner:
gawk '{a[$1]} END {for (i in a) print i}' hightemp.txt
gawk '{a[$1]} END {for (i in a) sum++; print sum}' hightemp.txt

18. 各行を 3 コラム目の数値の降順にソート

各行を 3 コラム目の数値の逆順で整列せよ（注意: 各行の内容は変更せずに並び替えよ）。確認には sort コマンドを用いよ（この問題はコマンドで実行した時の結果と合わなくてもよい）。

command line:
sort -r -n -k 3 hightemp.txt

R:
d = readLines("hightemp.txt")
d[order(sapply(d, function(s) unlist(strsplit(s, "\t"))[3]), decreasing=TRUE)]

AWK one liner:
不向き

19. 各行の 1 コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

各行の 1 列目の文字列の出現頻度を求め，その高い順に並べて表示せよ。確認には cut, uniq, sort コマンドを用いよ。

cut -f 1 hightemp.txt | sort | uniq -c | sort -r

d = read.table("hightemp.txt", header=FALSE, as.is=TRUE)
sort(table(d[,1]), decreasing=TRUE)

AWK one liner
gawk '{a[$1]++} END{for (i in a) print i, a[i] | "sort -r -k 2"}' hightemp.txt

頑張ってます。クリックお願いします。

言語処理100本ノック第1章 : 準備運動

2015年10月28日 | ブログラミング

東北大学の乾・岡崎研究室で公開されている言語処理100本ノック（2015年版）http://www.cl.ecei.tohoku.ac.jp/nlp100/ を、R言語で解く。

同趣旨のページ https://rpubs.com/yamano357/84965 では，

library(dplyr)
library(stringr)
library(stringi)

なんかを使っているんだけど，かえって面倒くさくなっているように見受けられる（ご本人はキレイだと思っているんだろうなぁ）。

そこで，特別なパッケージなど使わずに，基本関数だけで書く（関数のネストって，キレイだと思うんだけどなぁ）。

00. 文字列の逆順

文字列 “stressed” の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ。

paste(rev(unlist(strsplit("stressed", ""))), collapse="")

01. 「パタトクカシーー」

「パタトクカシーー」という文字列の 1,3,5,7 文字目を取り出して連結した文字列を得よ。

paste(unlist(strsplit("パタトクカシーー", ""))[1:4*2-1], collapse="")

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ。

paste(t(matrix(unlist(strsplit("パトカータクシー", "")), 4)), collapse="")

03. 円周率

“Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.” という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ。

s = gsub("[,.]", "", "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.")
nchar(unlist(strsplit(s, " ")))

04. 元素記号

“Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.” という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19 番目の単語は先頭の 1 文字，それ以外の単語は先頭の 2 文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ。

s = gsub("[.]", "", "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.")
s = sapply(unlist(strsplit(s, " ")), substring, 1, 2)
i = c(1, 5, 6, 7, 8, 9, 15, 16, 19)
s[i] = substr(s[i], 1, 1)
names(s) = 1:length(s)
s

05. n-gram

与えられたシーケンス（文字列やリストなど）から n-gram を作る関数を作成せよ。この関数を用い，“I am an NLPer” という文から単語 bi-gram，文字 bi-gramを得よ。

func = function(s) {
if (is.list(s)) s = paste(unlist(s), sep=" ")
s = unlist(strsplit(s, " "))
t = unlist(strsplit(s, ""))
list(word.bi.gram = cbind(s[-length(s)], s[-1]),
       char.bi.gram = cbind(t[-length(t)], t[-1]))
}
func("I am an NLPer") # 引数が文字列
func(list("I am", "an", "NLPer")) # 引数がリスト

06. 集合

“paraparaparadise” と “paragraph” に含まれる文字 bi-gram の集合を，それぞれ, X と Y として求め，X と Y の和集合，積集合，差集合を求めよ。さらに，`se' という bi-gram が X および Y に含まれるかどうかを調べよ。

func = function(s) {
s = unlist(strsplit(s, ""))
unname(mapply(function(x, y) paste(x, y, sep=""), s[-length(s)], s[-1]))
}
(X = func("paraparaparadise"))
(Y = func("paragraph"))
union(X, Y)
intersect(X, Y)
setdiff(X, Y)
is.element("se", X)
is.element("se", Y)

07. テンプレートによる文生成

引数 x, y, z を受け取り「x 時の y は z」という文字列を返す関数を実装せよ。さらに，x=12, y=“気温”, z=22.4 として，実行結果を確認せよ。

func1 = function(x, y, z) sprintf("%s時の%sは%s", x, y, z)
func1(12, "気温", 22.4)

func2 = function(x, y, z) paste(x, "時の", y, "は", z, sep="", collapse="")
func2(12, "気温", 22.4)

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数 cipher を実装せよ。
- 英小文字ならば (219 - 文字コード) の文字に置換
- その他の文字はそのまま出力
この関数を用い，英語のメッセージを暗号化・復号化せよ。

s = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
cipher = function(s) paste(sapply(unlist(strsplit(s, "")), function(c) ifelse(is.element(c, letters), intToUtf8(219-utf8ToInt(c)), c)), collapse="")
(t = cipher(s))
cipher(t) # a to z を z to a にする関数だから，暗号化された文を同じ関数に渡せば元に戻る

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ。ただし，長さが４以下の単語は並び替えないこととする。適当な英語の文（例えば “I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind .”）を与え，その実行結果を確認せよ。

s = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind."
paste(sapply(unlist(strsplit(s, " ")), function(t) {
if ((n = nchar(t)) > 4) {
    t = unlist(strsplit(t, ""))
    t = paste(t[1], paste(sample(t[2:(n-1)]), collapse=""), t[n], sep="")
}
t}), collapse=" ")

頑張ってます。クリックお願いします。

BASP の「P 値禁止，CI も禁止。ベイズ推定にも乗り気でない」

2015年10月28日 | 統計学

> 一般的な心理学実験よりもサンプル数を多くすることを奨励する。なぜなら、サンプル数が多いほど記述的統計の安定性が増し、標本誤差の問題が相対的に低くなるためである

これは，訳の問題なんだろうけど，「サンプルスウ」という用語を使っているだけで，訳者の程度が知れるというもの。サンプルサイズとか「標本の大きさ」と訳すべし。

「標本誤差の問題が相対的に低くなったかどうか」を「絶対的に」評価するのが P 値である。差が標本誤差に比べてどれくらい大きいか小さいかを表す数値が検定統計量。P 値はその検定統計量に基づいて計算されるもの。P < 0.05 という「絶対基準」を盲信するのは問題あるかも知れないが，「きっとこれくらいのサンプルサイズなんだから，大丈夫だろう」なんて，科学的ではないなぁ。

別の記事で，アメリカ（？）の心理学関連の学会論文に P 値の記載が怪しいものが 10 数パーセントあるとか。
「日本心理学会を嗤うブログ」というのがある。怪文書じみていて，本当かどうか知らんけど，ケチョンケチョンにこき下ろしているんだが。

まあ，関係者は（どちらの側も）大変なんだろうけど，分野外の人は「勝手にやったらよろしい」と思っているんだろうか。

http://www.editage.jp/insights/a-taylor-francis-journal-announces-ban-on-p-values

http://link.springer.com/article/10.3758/s13428-015-0664-2

http://jpa2013.seesaa.net/

頑張ってます。クリックお願いします。

マイナンバーのチェックディジット（その2）

2015年10月13日 | ブログラミング

ニュースなどで，マイナンバーの例として 123456789012 なんてのが示されるが，そんなマイナンバーはない。

> func(123456789012)
[1] "NG"

> func(12345678901)
[1] 8
> func(123456789018)
[1] "OK"

先頭 11 桁が 12345678901 ならば，最終桁は 8 でなければならない www

頑張ってます。クリックお願いします。

マイナンバーのチェックディジット

2015年10月13日 | ブログラミング

以下の関数は，12 桁のマイナンバーを与えると誤入力の有無を通知する。11桁を入力するとチェックディジットを返す。

func = function(n) {
   s = as.character(n)
   n = nchar(s)
   if (n < 11 || n > 12) return("Error")
   s = as.numeric(unlist(strsplit(s, "")))
   d = sum(s[1:11] * c(6:2, 7:2)) %% 11
   d = ifelse(d < 2, 0, 11 - d)
   ifelse(n == 11, d, ifelse(s[12] == d, "OK", "NG"))
}

> func(32111343233)
[1] 1
> func(321113432331)
[1] "OK"
> func(421113332331)
[1] "OK"

最後の 2 つの例を見れば明らかだが，「意図的に」本人のものではないマイナンバーが詐称されると，チェックディジットではチェックできない。

頑張ってます。クリックお願いします。

シフトと加算

2015年10月08日 | ブログラミング

装置がある。
装置にはディスプレイがあり，そこには最初 1 と表示されている。
ディスプレイの下には［+1］と［×2］という 2 つのボタンがある。
それぞれ，ディスプレイに表示されている数に 1 を加える，2 倍する，という機能だ。
ディスプレイある数を表示するために，最低何回ボタンを押さなければならないか求めよ。
たとえば，
10 を表示するためには ×2, ×2, +1, ×2 の 4 回
40 を表示するためには ×2, ×2, +1, ×2, ×2, ×2 の6 回
60 を表示するためには ×2, +1, ×2, +1, ×2, +1, ×2, ×2 の 8 回
65 を表示するためには ×2, ×2, ×2, ×2, ×2, ×2, +1 の 7 回

つまり，「全ての整数は，二進表示で，初期値 1 の左シフトと 1 の加算でできるということ。
左から順に 2 桁目以降が 0 なら左シフト 1 回，1 なら，左シフト 1 回と 1 の加算が必要。

func = function(n) {
a = NULL
repeat {
a = c(n %% 2, a)
if ((n = n %/% 2) == 0) break
}
a = a[-1]
sum(a == 0) + sum(a == 1)*2
}

> func(10)
[1] 4

> func(40)
[1] 6

> func(60)
[1] 8

> func(65)
[1] 7

> func(10000000)
[1] 30

頑張ってます。クリックお願いします。

実数を分数で近似

2015年10月08日 | ブログラミング

実数 x，0.1 ≦ x ≦ 10 を，近似誤差が最も小さくなるような分数で表せ。
ただし，分子，分母共に 6 桁以内の整数とする。
たとえば，x = 1.618033963166706... の場合は，6765 / 4181 である。

変数名を長くしたので複雑そうに見えるが，実に簡単。for 文を使わず，ベクトル計算でやる。

func2 = function(x) {
denominator = 1:999999
numerator = as.integer(x*denominator)
denominator = rep(denominator, 2)
numerator = c(numerator, numerator +1)
is.ok = 999999 >= numerator
numerator = numerator[is.ok]
denominator = denominator[is.ok]
subscript = which.min(abs(x-numerator/denominator))
cat(numerator[subscript], "/", denominator[subscript])
}

> func2(1.618033963166706)
6765 / 4181

頑張ってます。クリックお願いします。

左右対称な二進数

2015年10月07日 | ブログラミング

m より大きく，n より小さい数のうち，次の条件を満たす整数はいくつあるか
条件：元の数を二進表記し，左右反転したものを十進に直したとき，元の数と同じになる

なにもねえ，左右反転して十進に直すなんてしなくてよいよ。
二進数が左右対称であるかどうかみればよいだけ。
例：(17)10 = (10001)2

func = function(m, n) {
    s = 0
    for (i in (m + 1):(n - 1)) {
        k = NULL
        repeat {
            k = c(k, i %% 2)
            if ((i = i %/% 2) == 0) break
        }
        s = s + all(k == rev(k))
    }
    s
}
func(0, 10000) # 204

頑張ってます。クリックお願いします。

法による計算

2015年10月02日 | ブログラミング

10 進数の自然数 n（1 ≦ n ≦ 10¹⁰）に対して，16 進数の A を n 個並べた数を F(n) と定義する。
F(n) を 10 進数で表したものを 10⁶ で割った余りを出力する。

例えば，F(10) を 10 進数で表すと 733007751850 で，この数を 10⁶ で割った余りは 751850 である。

> fun = function(n) {
+    ans = 10
+    n = (n-2) %% 3125
+    if (n >= 0) {
+        for (i in 0:n) {
+            ans = (ans*16 + 10) %% 1e6
+        }
+    }
+    ans
+ }

> fun(10)
[1] 751850
> fun(9999999999)
[1] 462890

ヒント: 1e6 の剰余って，実際の値には限りがある。1e6 通りもあるわけでもない。

頑張ってます。クリックお願いします。

記事一覧 | 画像一覧 | フォロワー一覧 | フォトチャンネル一覧

PVアクセスランキングにほんブログ村

プロフィール

フォロー中フォローするフォローする

バックナンバー

2025年03月

2025年02月

2025年01月

2024年12月

2024年11月

2024年10月

2024年09月

2024年08月

2024年07月

2024年06月

2024年05月

2024年04月

2024年03月

2024年02月

2024年01月

2023年12月

2023年11月

2023年10月

2023年09月

2023年08月

2023年07月

2023年06月

2023年05月

2023年04月

2023年03月

2023年02月

2023年01月

2022年12月

2022年11月

2022年10月

2022年09月

2022年08月

2022年07月

2022年06月

2022年05月

2022年04月

2022年03月

2022年02月

2022年01月

2021年12月

2021年11月

2021年10月

2021年09月

2021年08月

2021年07月

2021年06月

2021年05月

2021年04月

2021年03月

2021年02月

2021年01月

2020年12月

2020年11月

2020年10月

2020年09月

2020年08月

2020年07月

2020年06月

2020年05月

2020年04月

2020年03月

2020年02月

2020年01月

2019年12月

2019年11月

2019年09月

2019年08月

2019年07月

2019年06月

2019年04月

2018年12月

2018年09月

2018年08月

2018年07月

2018年06月

2018年05月

2018年03月

2018年02月

2018年01月

2017年12月

2017年11月

2017年10月

2017年09月

2017年08月

2017年07月

2017年06月

2017年05月

2017年04月

2017年03月

2017年02月

2017年01月

2016年12月

2016年11月

2016年10月

2016年09月

2016年08月

2016年07月

2016年06月

2016年05月

2016年04月

2016年03月

2016年02月

2016年01月

2015年12月

2015年11月

2015年10月

2015年09月

2015年08月

2015年07月

2015年06月

2015年05月

2015年04月

2015年03月

2015年02月

2015年01月

2014年12月

2014年11月

2014年10月

2014年09月

2014年08月

2014年07月

2014年06月

2014年05月

2014年04月

2014年03月

2014年02月

2014年01月

2013年12月

2013年11月

2013年10月

2013年09月

2013年08月

2013年07月

2013年06月

2013年05月

2013年04月

2013年03月

2013年02月

2013年01月

2012年12月

2012年11月

2012年10月

2012年09月

2012年08月

2012年07月

2012年06月

2012年05月

2012年04月

2012年03月

2012年02月

2012年01月

2011年12月

2011年11月

2011年10月

2011年09月

2011年07月

2011年06月

2011年05月

2011年04月

2011年03月

2011年02月

2011年01月

2010年09月

2010年08月

2010年07月

カレンダー

2015年10月
日	月	火	水	木	金	土
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

前月

次月

goo blog お知らせ

	【PR】プロ直伝・dポイントをザクザクためる術
	【PR】安い＆大量の「訳あり商品」がヤバい！
	【コメント募集中】一番好きな「漫画」は何ですか？
	訪問者数に応じてdポイント最大1,000pt当たる！
	dポイントが当たる！無料『毎日くじ』