2019年4月のブログ記事一覧-裏 RjpWiki

ジッター（3）

2019年04月30日 | ブログラミング

ggplot の boxplot でデータもプロットすると，jitter が使われて悲惨な結果になる
http://rpubs.com/msonnabaum/ebs_piops
"EBS provisioned IOPS" がその一つの例だろう。

library(ggplot2)
library(reshape)

df.ebs <- read.csv("http://dl.dropbox.com/u/361076/raw_ebs_stats.csv")
df.ebs.melted <- melt(df.ebs, id = c("bricks", "server_type", "instance_id"),
    variable_name = "op", na.rm = TRUE)
df.ebs.melted$test = "ebs"

df.piops <- read.csv("http://dl.dropbox.com/u/361076/raw_piops_stats.csv")
df.piops.melted <- melt(df.piops, id = c("bricks", "server_type", "instance_id"),
    variable_name = "op", na.rm = TRUE)
df.piops.melted$test <- "piops"

df <- rbind(df.ebs.melted, df.piops.melted)

ggplot(df, aes(x = factor(bricks), y = value, colour = test)) + geom_boxplot() +
    geom_jitter(alpha = 0.3) + facet_grid(~op)

悲惨というほかない。グチャグチャだ。

これを，どうやってまともなグラフにするか。

df2 <- split(df, df$op)
x <- c(1,2,3,3,5,4,7,5)
col = c("#88000020", "#00888820")
layout(matrix(1:2, 1))
par(mar=c(2, 2.5, 1, 1), mgp=c(1.6, 0.6, 0))
aa <- lapply(df2, function(d2) {
   df3 <- split(d2, d2$bricks)
   boxplot(value ~ bricks, data=d2, ylim=c(0, 550), boxwex = 0.25, cex=0.5,
            subset=test=="ebs", at=1:5-0.2, col=col[1], main=d2$op[1])
   legend("top", c("ebs", "piops"), pch=19, col=col, bty="n")
   sapply(df3, function(d3) {
       v <- d3$value[d3$test=="ebs"]
       segments(x[d3$bricks]-0.05, v, x[d3$bricks]+0.02, v, col=col[1])
       })
   boxplot(value ~ bricks, data=d2, ylim=c(0, 550), boxwex = 0.25, cex=0.5,
            subset=test=="piops", at=1:5+0.2, col=col[2], main=d2$op[1],
            xaxt="n", add=TRUE)
   sapply(df3, function(d3) {
       v <- d3$value[d3$test=="piops"]
       segments(x[d3$bricks]+0.35, v, x[d3$bricks]+0.42, v, col=col[2])
       })
})
layout(1)

すると，こういうグラフになる。

「キレイなグラフがカンタンに描ける」というとき，「キレイ」というのは，そのプログラム(パッケージ)を作った人の主観。「カンタン」というのも同じ。

頑張ってます。クリックお願いします。

ジッター（2）

2019年04月30日 | ブログラミング

http://rpubs.com/wch/1270 

Test document となっているので，試験的なものなのだろうが，内容がよくない。

  dat 
x y g 
1 a 0.37620 1
 2 b 0.46821 1
 3 a 0.73924 2
 4 b 0.43202 2 
5 a 0.34295 3 
6 b 0.24804 3 
7 a 0.20613 4 
8 b 0.28183 4 
9 a 0.46331 5 
10 b 0.02378 5
 11 a 0.09151 6
 12 b 0.61472 6 
13 a 0.22766 7 
14 b 0.37150 7
 15 a 0.34978 8
 16 b 0.91942 8 
17 a 0.87007 9 
18 b 0.09859 9
 19 a 0.76906 10 
20 b 0.25969 10 

というようなデータを，jitter を使わない場合（図1）と，2 通りの jitter を使って描く場合（図2，図3）を紹介している。

  library(ggplot2) 
ggplot(dat, aes(x = x, y = y, group = g)) + geom_point() + geom_line()
 ggplot(dat, aes(x = x, y = y, group = g)) + geom_point(position = position_jitter(width = 0.1)) +
  geom_line(position = position_jitter(width = 0.1))
 dat2 <- dat 
dat2$x <- as.numeric(dat2$x) 
dat2$x <- jitter(dat2$x) 
ggplot(dat2, aes(x = x, y = y, group = g)) + geom_point() + geom_line() + scale_x_continuous(breaks = c(1, 2), 
labels = c("a", "b"))

  jitter は使わない方がよい。特にこのような図の場合，二点を結ぶ直線の x 方向の jitter が二点で違うため，直線の傾きに誤差が含まれてしまっている（でたらめになっている）。図3は特にひどい。

 jitter を使わない場合も，もし，データがもっと多くなったら，線分が重なって，何が何だかわからなくなるし，前と後で値が増えたのが多いのか減ったのが多いのかを図から見ることも難しくなる。つまり，図を描いても，そこから情報をくみ取りにくくなる。

図1

図2

図3

ではどうしたらよいか。それは以下のような図を描くのである。横軸に前の測定値，縦軸に後の測定値。
 相関係数が負（回帰直線の傾きが負）なら，before の平均値より after の平均値が小さくなったことを表す。

d
     before   after    diff.
1   0.37620 0.46821 0.09201
2   0.73924 0.43202 -0.30722
3   0.34295 0.24804 -0.09491
4   0.20613 0.28183 0.07570
5   0.46331 0.02378 -0.43953
6   0.09151 0.61472 0.52321
7   0.22766 0.37150 0.14384
8   0.34978 0.91942 0.56964
9   0.87007 0.09859 -0.77148
10 0.76906 0.25969 -0.50937

> colMeans(d)
   before     after     diff.
0.443591 0.371780 -0.071811

d <- data.frame(t(matrix(dat$y, 2)))
d[,3] <- d[,2]-d[,1]
colnames(d) <- c("before", "after", "diff.")
plot(d[,1:2], pch=19, asp=1, xlim=c(0, 1), ylim=c(0, 1))
abline(lm(d[,2]~d[,1]), col="red")
text(0.8, 0.8, paste("r =", round(cor(d[,2], d[,1]), 3)))

頑張ってます。クリックお願いします。

ジッター

2019年04月30日 | ブログラミング

スパムコメントが多いので，元記事を消去した

特に，整数値など限られた値を取る変数の散布図は，プロット点の重なりが表現できないので，jitter を使う方法が紹介されるが，jitter も使いようで誤解を与えるので，以下のような方法を。色の濃さで重なりがわかる。8 桁の 16 進数の下位 2 桁がαチャネル。

2019年4月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

裏 RjpWiki

Julia ときどき R, Python によるコンピュータプログラム，コンピュータ・サイエンス，統計学

ジッター（3）

ジッター（2）

ジッター

PVアクセスランキング にほんブログ村

プロフィール

最新記事

バックナンバー

カレンダー

カテゴリー

最新コメント

雨雲の動き

ログイン

goo blog お知らせ

goo blog おすすめ

PVアクセスランキングにほんブログ村