2012年8月のブログ記事一覧-裏 RjpWiki

Rstudio の使い方（Mac の場合？）

2012年08月31日 | ブログラミング

少なくとも，私の環境 Mac OS X 10.8.1 では，Rstudio 0.96.330 は，日本語入力においてユーザにストレスを感じさせる（入力中，全く入力状況が表示されない）。とても，まともに使うことはできない。

偶然（？）発見した少しはましなやり方は，

1. 外部エディタ（私は Jedit 2.33）を使って，ソース（*.Rmd）を作る・編集する

2. Rstudio で，そのファイルを使用するようにしておくと，外部エディタの編集結果を保存すると，Rstudio のソース・ウインドウに *.Rmd の編集結果が反映される

3. Rstudio で "Preview" をクリックすると，その時点での *. Rmd ファイルから *.html ファイルが作成され，その結果が Rstudoio:Preview HTML ウインドウに表示される

これを繰り返して，最終的なファイルを得ることができるということ。

まあ，R コンソールで knit2html 関数を実行し，ブラウザで結果を確認するという繰り返しでもよいのだけど，少しだけ手間を省ける。

頑張ってます。クリックお願いします。

R Markdown まとめ

2012年08月30日 | ブログラミング

不十分ではあるが，http://rpubs.com/r-de-r/1373　にまとめた

頑張ってます。クリックお願いします。

Mountain Lion で Rcpp が使えなくなっていた

2012年08月29日 | ブログラミング

原因は，Xcode 4.4 になったが，デフォルトのインストールでは Command line tools がインストールされていなかった。

Command line tools は，

Downloading this package will install copies of the core command line tools and system headers into system folders, including the LLVM compiler, linker, and build tools.

ということ。これがないと，コンパイルできなかったのだ。

Xcode の「環境設定」--> Downloads タグ --> Components で，Comand Line Tools を選択してインストール

頑張ってます。クリックお願いします。

ダメ出し：よく分からないけど，ダメ　その2

2012年08月28日 | ブログラミング

「バク＝スネッペンのゲーム」

こんな感じになるのかな?

n <- 100
limit <- 2/3
x <- runif(n+2, 0, 1)
repeat {
   xmin <- which.min(x[2:(n+1)])
   if (x[xmin+1] >= limit) break
   x[xmin:(xmin +2)] <- runif(3)
}
z <- numeric(n)
for (i in 1:n) {
   y <- 0
   xmin <- which.min(x[2:(n+1)])
   x[xmin:(xmin +2)] <- runif(3)
   repeat {
       xmin <- which.min(x[2:(n+1)])
       if (x[xmin+1] >= limit) break
       x[xmin:(xmin +2)] <- runif(3)
       y <- y+1
   }
   z[i] <- y
}
plot(sort(log(z)))

fit は，データに 0 があるとまずいのではないでしょうかね...

> z <- z[z > 0]
> summary(power.law.fit(z))
Maximum likelihood estimation

Call:
mle(minuslogl = mlogl, start = list(alpha = start))

Coefficients:
Estimate Std. Error
alpha 0.4585079 0.05960587

頑張ってます。クリックお願いします。

ダメ出し：よく分からないけど，ダメ

2012年08月27日 | ブログラミング

「バク＝スネッペンのゲーム」は，ちゃんとプログラムできていない。
プログラムの仕様（ゲームのルール）がよく分からないので，書き直しできないなあ。

# 種の数
n <- 100
x <- runif(n, 0, 1)
z <- numeric(0)
# すべてを2/3以上にする操作
repeat {
    x[0] <- 1                    # R の添え字は 1 から始まる
    x[n + 1] <- 1
    xmin <- which.min(x)         # xmin が 1 や n のときの処理はこれではだめ
    if (x[xmin] < 2/3) {         # x[0] には代入できないし，x[n+1] が最小になったら要素が n+1 に加わる
        xr <- runif(3, 0, 1)
        x[xmin - 1] <- xr[1]
        x[xmin] <- xr[2]
        x[xmin + 1] <- xr[3]
    } else break
}

# sampleを100個
for (i in 1:100) {
    y <- 0
    # 最小のものにある値を代入
    x[0] <- 1
    x[n + 1] <- 1
    xmin <- which.min(x)
    xr <- runif(3, 0, 1)
    x[xmin - 1] <- xr[1]
    x[xmin] <- xr[2]
    x[xmin + 1] <- xr[3]
    # すべてを再び2/3以上に
    repeat {
        x[0] <- 1
        x[n + 1] <- 1
        xmin <- which.min(x)
        if (x[xmin] < 2/3) {
            xr <- runif(3, 0, 1)
            x[xmin - 1] <- xr[1]
            x[xmin] <- xr[2]
            x[xmin + 1] <- xr[3]
            y <- y + 1
        } else break
    }
    z <- rbind(z, y)      # y は 1 個の要素，z はベクトルなので，せめて z <- c(z, y)
}                         # しかし，それは間違った方法

頑張ってます。クリックお願いします。

ダメ出し：match は遅い

2012年08月25日 | ブログラミング

「R で特定の値の行や列を見つける方法」で，match 関数が挙げられている。後半に which.max, which.min が挙げられているのになぜ，which に言及されていないのか？

まず，ベンチマークテスト

> a <- 1:1e6
> library(rbenchmark)
> benchmark(which(a == 1e4), match(1e4, a), columns=c("test", "elapsed", "relative", "user.self", "sys.self", "replications"))

               test elapsed relative user.self sys.self replications
2   match(10000, a)   7.227 7.796117     6.921    0.364          100
1 which(a == 10000)   0.927 1.000000     0.729    0.209          100

match と which の違いを示す

> x <- c(9, 5, 10, 7, 4, 2, 6, 8, 3, 4)
> match(4, x)
[1] 5
> which(x == 4)
[1] 5 10
> match(c(5, 10), x)
[1] 2 3

頑張ってます。クリックお願いします。

ダメ出し：提示する統計グラフに注意　その2

2012年08月24日 | 統計学

前の記事（元記事も）での religion = unknown の記号の色が gray で，バックグラウンドの色と近くて，目に入りにくいということもあるし，プログラムがごちゃごちゃしているので religion 別の散布図を描いた方がよさそう。

ということで，描いてみた。シンプルに plot を使って（余分な情報はいらない）。

df.split <- split(df.merged, df.merged$religion)
layout(matrix(1:4, 2, byrow=TRUE))
par(mgp=c(1.6, 0.6, 0), mar=c(3, 3, 0.5, 0.8))
lapply(df.split, function(d) {
   plot(babies~income, data=d, xlim=c(0, 80000), ylim=c(0, 7),]
         pch=19, col="#00330060", cex=0.7)
   text(35000, 6, d[1, "religion"])
   })
layout(1)

Eastern religions は国の数も少ないが，babies が 3 ぐらいまでで，他とはちょっと違う。
income < 15000 では babies > 3 の国がかなりある。income > 15000 では babies はほぼ一定。

頑張ってます。クリックお願いします。

コメント (1)

ダメ出し：提示する統計グラフに注意

2012年08月24日 | 統計学

"Getting data from GapMinder.org" のグラフについて

当初は，このグラフ（図1）を見て，「記号の大きさが人口に比例していないなあ。困ったものだ」と思っていた。おまけに，記号の大きさは人口そのままではなく，log(人口+1e7) になっている。

図1

ggplot は記号の大きさを，最大値と最小値から決めるようで，筆者が 1e7 を採用したのは，記号の大きさのバランスがちょうど好みにあったのだろう。population をそのまま使った図2と比較すればよい。

図2

そして，このふたつの図を比べると，受ける印象が全く違うことにびっくりする。どちらかの図が不適切なわけだ。だって，どちらも適切なわけがない。
筆者がなぜ babies とincome, population, religion を取り上げたのか真意は不明だが，このデータを分析してみる。
まず，religion 別に population, babies, income 相互間の相関を見る。
人口が極端に大きいふたつの国(インドと中国）の影響を除くために，スピアマンの順位相関係数を計算した。

> df.split <- split(df.merged, df.merged$religion)
> lapply(df.split, function(d) round(cor(d[,3:5], use="pair", method="spearman"), 3))
$Christian
           population babies income
population      1.000 -0.029 -0.028
babies         -0.029 1.000 -0.721
income         -0.028 -0.721 1.000

$`Eastern religions`
           population babies income
population      1.000 0.104 -0.275
babies          0.104 1.000 -0.871
income         -0.275 -0.871 1.000

$Muslim
           population babies income
population      1.000 0.023 -0.098
babies          0.023 1.000 -0.757
income         -0.098 -0.757 1.000

$unknown
           population babies income
population      1.000 0.043 0.028
babies          0.043 1.000 -0.689
income          0.028 -0.689 1.000

この結果を見ると，religion に係わらず，3変数は同じような相関関係にあることがわかる。Eastern religions で，population と babies, income の間の相関が他と比べると若干高いが。babies も income も population で調整されているので，こういうことをしなくても population の要因は除外してよいだろうということがわかる。とすれば，当初描かれた図は，特に人口サイズが強調された図は不適切ということであろう。ということで，図3を得る。

図3

（つづく）

頑張ってます。クリックお願いします。

ダメ出し：そんなことやっちゃダメだってば!

2012年08月24日 | ブログラミング

「haploid から diploidへの座標変換」の，「座標で表す」の項だけど，

むちゃくちゃしますね。びっくりします。ほんとに。

p1 <- c(0.3,0.7)
p2 <- c(0.4,0.6)
f <- 0.6 # r^2の値
# 連鎖平衡の下でのハプロタイプ頻度
h.LE <- p1 %*% t(p2)
# 連鎖不平衡の下でのハプロタイプ頻度
delta <- f^(1/2) * prod(h.LE)^(1/4)
h.LD <- h.LE + delta * matrix(c(1,-1,-1,1),2,2)
# r^2は以下で求まる統計量
chisq.test(h.LE,correct=FALSE)
chisq.test(h.LD,correct=FALSE)

> h.LE
[,1] [,2]
[1,] 0.12 0.18
[2,] 0.28 0.42

> h.LD
[,1] [,2]
[1,] 0.2938965 0.006103479
[2,] 0.1061035 0.593896521

こんなものを chisq.test に渡しちゃダメダメ!!
引数は，集計表!
こんなこと，初心者だってやらないよ。

頑張ってます。クリックお願いします。

ダメ出し：ifelse は遅い

2012年08月23日 | ブログラミング

Rでフィボナッチ数！ http://d.hatena.ne.jp/mickey24/20080914/1221334119
syou6162先生が一行でやってくれました http://d.hatena.ne.jp/mickey24/20080914/1221334177
Rでフィボナッチ数！(動的計画法版) http://d.hatena.ne.jp/mickey24/20080914/1221336203

再帰関数として書いたもの　これは，おそい
fib1 <- function(n) {
if (n == 0) { return(0) }
if (n == 1) { return(1) }
fib1(n-1) + fib1(n-2)
}

一行野郎で書いたもの　これは，作者は心外だろうが，大変大変遅い
なぜおそいかというと，それは，ifelse を使っているから
ifelse は，条件によらず二通りの計算をして，条件によりどちらかを選ぶ　だから，とってもとっても遅い
fib2 <- function(n) {
   ifelse (n > 1, Recall(n-1) + Recall(n-2), ifelse(n == 1, 1, 0))
}

同じ一行にするのでも，if else を使えば func1 と基本的に同じなのだから，ifelse ほど遅くはない
fib3 <- function(n) {
   if (n > 1) Recall(n-1) + Recall(n-2) else if (n == 1) 1 else 0
}

動的計画法版のプログラムだが，ベクトルを使う必要は余り感じられない
fib4 <- function(n) {
if (n == 0) { return(0) }
x <- numeric(max(n,2))
x[1] <- x[2] <- 1
m <- 3
while (m <= n) {
    x[m] <- x[m-1] + x[m-2]
    m <- m + 1
}
x[n]
}

普通の変数を使って書けば，最速!!
fib5 <- function(n) {
   if (n == 0) return(0)
   if (n == 1) return(1)
   a <- 0
   b <- 1
   for (i in 2:n) {
       c <- a + b
       if (i == n) return(c)
       a <- b
       b <- c
   }
}

> n <- 15
> benchmark(fib1(n), fib2(n), fib3(n), fib4(n), fib5(n),
+           columns=c("test", "elapsed", "relative", "replications"),
+           replications=200)
     test elapsed    relative replications
1 fib1(n)   0.751 125.166667          200
2 fib2(n)   6.229 1038.166667          200
3 fib3(n)   0.803 133.833333          200
4 fib4(n)   0.014    2.333333          200
5 fib5(n)   0.006    1.000000          200

頑張ってます。クリックお願いします。

ダメ出し：重回帰分析の標準誤差の推定値(Residual standard error)

2012年08月23日 | ブログラミング

C.P. ロバート， G. カセーラ著
石田基広，石田和枝
R によるモンテカルロ法入門

23ページに，「lm() 関数は（中略）意外なことに標準誤差の推定値がリスト要素として含まれておらず（中略）このすうちそのものは次のようにして求める必要があります」

これは，summary.lm 関数が返す $sigma である。わざわざ計算しなくても良い。
回帰の分散分析の F 統計量と自由度も lm 関数の戻り値にはないが，summary.lm 関数の戻り値には含まれる。しかし，その P 値は自分で計算しないといけない。

> x <- 1:10
> y <- c(9, 7, 1, 4, 5, 10, 6, 2, 8, 3)
> a <- lm(y ~ x)
> (b <- summary(a))

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q Median      3Q     Max
-4.9697 -1.7500 0.0939 2.2015 4.5939

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   6.5333     2.1547   3.032   0.0163 *
x            -0.1879     0.3473 -0.541   0.6032
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.154 on 8 degrees of freedom
Multiple R-squared: 0.0353,   Adjusted R-squared: -0.08529
F-statistic: 0.2927 on 1 and 8 DF, p-value: 0.6032

> (u <- sqrt(sum(a$res^2) / a$df))
[1] 3.154122
> (v <- b$sigma) # これですよ!
[1] 3.154122
> all(u == v)
[1] TRUE

> b$fstatistic
    value     numdf     dendf
0.2927201 1.0000000 8.0000000

> (p <- pf(b$fstatistic[1], b$fstatistic[2], b$fstatistic[3], lower.tail=FALSE))
    value
0.6032176

頑張ってます。クリックお願いします。

barplot 関数の戻り値

2012年08月23日 | ブログラミング

barplot 関数は，描いた棒の中心 x 座標を返す。

座標軸，目盛り等を描くときにはそれを用いればよい。

> x <- dbinom(0:6, 6, 0.1)
> a <- barplot(x)
> str(a)
num [1:7, 1] 0.7 1.9 3.1 4.3 5.5 6.7 7.9
> axis(1, a, 0:6)

頑張ってます。クリックお願いします。

コメント (1)

ダメ出し：小細工せずに素直に（あるいは，効果を確かめてから）　その2

2012年08月23日 | ブログラミング

驚くべきことに（??）

w <- matrix(a, 2)[2,]

は

v <- a[1:(n%/%2)*2L] # 最速！！

と同じくらいに速い。

頑張ってます。クリックお願いします。

ダメ出し：小細工せずに素直に（あるいは，効果を確かめてから）

2012年08月22日 | ブログラミング

「kingqwertの覚書」の
http://d.hatena.ne.jp/kingqwert/
2012-08-22 の項
http://d.hatena.ne.jp/kingqwert/20120822

「リサイクル規則をうまく使う方法」を挙げてあるが，そんなに良い方法ではない。内部では結局の所リサイクル処理をしているので，1:(n%/%2)*2 のように選択する添え字ベクトルを指定してやるのと同等の速度である。むしろ，一番最後の例のように，添え字が整数ベクトルになるようにすれば，一番速い。

> n <- 10000000L
> a <- rnorm(n)
>
> invisible(gc()); invisible(gc()); system.time({
+    x <- a[seq_along(a) %% 2 == 0]
+ })
   ユーザ   システム       経過
     0.363      0.010      0.371
> invisible(gc()); invisible(gc()); system.time({
+    y <- a[!(seq_along(a) %% 2)]
+ })
   ユーザ   システム       経過
     0.345      0.000      0.344
> invisible(gc()); invisible(gc()); system.time({
+    z <- a[c(FALSE,TRUE)]
+ })
   ユーザ   システム       経過
     0.076      0.001      0.077
> invisible(gc()); invisible(gc()); system.time({
+    u <- a[1:(n%/%2)*2]
+ })
   ユーザ   システム       経過
     0.063      0.001      0.063
> invisible(gc()); invisible(gc()); system.time({
+    v <- a[1:(n%/%2)*2L]                        # 最速！！
+ })
   ユーザ   システム       経過
     0.048      0.000      0.048

> all(x == y)
[1] TRUE
> all(x == z)
[1] TRUE
> all(x == u)
[1] TRUE
> all(x == v)
[1] TRUE

頑張ってます。クリックお願いします。

ダメ出し：もう何回も書いたけど...

2012年08月22日 | ブログラミング

「高校数学でわかる統計学」の「第1章　サイは投げられた！（期待値と分散）」において，

乱数は纏めて発生させる。そうすれば for ループを使う必要もない。
両方併せると，実行時間に大きな違いが出る（もっとも，100000 × 2 個発生させて 1 秒なんだから，時間が問題なのではない（

> system.time({
+ n <- 100000
+ x <- numeric(n)
+ for (i in 1:n) {
+ x[i] <- sum(sample(x=1:6, size=2, replace=TRUE, prob=rep(1, 6)))
+ }
+ barplot(table(x) / n, ylim=c(0, 0.2))
+ })
   ユーザ   システム       経過
     1.302      0.017      1.318

> system.time({
+ n <- 100000
+ d <- matrix(sample(1:6, 2*n, replace=TRUE), 2)
+ x <- colSums(d)
+ barplot(table(x)/n, ylim=c(0, 0.2))
+ })
   ユーザ   システム       経過
     0.128      0.003      0.136

頑張ってます。クリックお願いします。

PVアクセスランキングにほんブログ村

プロフィール

フォロー中フォローするフォローする

バックナンバー

2025年04月

2025年03月

2025年02月

2025年01月

2024年12月

2024年11月

2024年10月

2024年09月

2024年08月

2024年07月

2024年06月

2024年05月

2024年04月

2024年03月

2024年02月

2024年01月

2023年12月

2023年11月

2023年10月

2023年09月

2023年08月

2023年07月

2023年06月

2023年05月

2023年04月

2023年03月

2023年02月

2023年01月

2022年12月

2022年11月

2022年10月

2022年09月

2022年08月

2022年07月

2022年06月

2022年05月

2022年04月

2022年03月

2022年02月

2022年01月

2021年12月

2021年11月

2021年10月

2021年09月

2021年08月

2021年07月

2021年06月

2021年05月

2021年04月

2021年03月

2021年02月

2021年01月

2020年12月

2020年11月

2020年10月

2020年09月

2020年08月

2020年07月

2020年06月

2020年05月

2020年04月

2020年03月

2020年02月

2020年01月

2019年12月

2019年11月

2019年09月

2019年08月

2019年07月

2019年06月

2019年04月

2018年12月

2018年09月

2018年08月

2018年07月

2018年06月

2018年05月

2018年03月

2018年02月

2018年01月

2017年12月

2017年11月

2017年10月

2017年09月

2017年08月

2017年07月

2017年06月

2017年05月

2017年04月

2017年03月

2017年02月

2017年01月

2016年12月

2016年11月

2016年10月

2016年09月

2016年08月

2016年07月

2016年06月

2016年05月

2016年04月

2016年03月

2016年02月

2016年01月

2015年12月

2015年11月

2015年10月

2015年09月

2015年08月

2015年07月

2015年06月

2015年05月

2015年04月

2015年03月

2015年02月

2015年01月

2014年12月

2014年11月

2014年10月

2014年09月

2014年08月

2014年07月

2014年06月

2014年05月

2014年04月

2014年03月

2014年02月

2014年01月

2013年12月

2013年11月

2013年10月

2013年09月

2013年08月

2013年07月

2013年06月

2013年05月

2013年04月

2013年03月

2013年02月

2013年01月

2012年12月

2012年11月

2012年10月

2012年09月

2012年08月

2012年07月

2012年06月

2012年05月

2012年04月

2012年03月

2012年02月

2012年01月

2011年12月

2011年11月

2011年10月

2011年09月

2011年07月

2011年06月

2011年05月

2011年04月

2011年03月

2011年02月

2011年01月

2010年09月

2010年08月

2010年07月

カレンダー

2012年8月
日	月	火	水	木	金	土
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

前月

次月

goo blog お知らせ

	【PR】プロ直伝・dポイントをザクザクためる術
	【PR】安い＆大量の「訳あり商品」がヤバい!
	【コメント募集中】「食パン」に何をつけて食べますか？
	訪問者数に応じてdポイント最大1,000pt当たる！
	dポイントが当たる！無料『毎日くじ』