R, Python, Julia の速度比較 2021/12
実験環境
macOS Monterey バージョン 12.1
Mac mini (M1, 2020)
チップ Apple M1
メモリ 8GB
R M1 チップ対応
version 4.2.0
platform aarch64-apple-darwin21.1.0
arch aarch64
os darwin21.1.0
system aarch64, darwin21.1.0
Python M1 チップ対応
Python 3.10.0
Julia M1 チップ対応
Version 1.7.0 (2021-11-30)
使用するデータ
> n = 100000
> m = 1500
> x = data.frame(round(matrix(rnorm(n*m, 50, 10), n, m), 2))
> system.time({write.csv(x, file="test.csv", row.names=FALSE)})
user system elapsed
59.049 1.063 60.135
データフレームとして読み込み
R の場合
> system.time({df = read.csv("test.csv", colClasses="numeric")})
user system elapsed
12.105 0.906 13.627
ちなみに,data.table パッケージの fread だと更に速い。
> library(data.table)
> system.time({df = fread("test.csv")})
user system elapsed
1.723 0.258 3.058
しかし,fread で読み込まれるのは data.table なので,data.frame と若干の違いがあるので注意。
以下は,read.csv で読み込んだ data.frame を使う。
Python の場合
>>> from time import time
>>> import pandas as pd
>>> start = time(); df = pd.read_csv('test.csv'); print(time() - start)
6.904027938842773
Julia の場合
using CSV, DataFrames
@time df = CSV.read("test.csv", DataFrame); # 4.313486 seconds
@time df = CSV.read("test.csv", DataFrame, type=Float64); # 3.268237 seconds
単変量関数
合計
> system.time({s = colSums(df)})
user system elapsed
0.368 0.318 0.743
>>> start = time(); s = df.sum(); print(time() - start)
2.547268867492676
@time s = sum.(eachcol(df)); # 0.044548 seconds
算術平均
> system.time({m = colMeans(df)})
user system elapsed
0.353 0.084 0.436
>>> start = time(); m = df.mean(); print(time() - start)
0.2523488998413086
@time m = mean.(eachcol(df)); # 0.041935 seconds
不偏分散
> system.time({v = sapply(df, var)})
user system elapsed
0.487 0.002 0.491
>>> start = time(); v = df.var(ddof=1); print(time() - start)
0.6684608459472656
@time v = var.(eachcol(df)); # 0.080120 seconds
標準偏差
> system.time({sd = sapply(df, sd)})
user system elapsed
0.496 0.003 0.497
>>> start = time(); sd = df.std(ddof=1); print(time() - start)
0.7022347450256348
@time sd = std.(eachcol(df)); # 0.076600 seconds
中央値
median() はデータをソートする必要があるので,処理時間は長くなる。
> system.time({sapply(df, median)})
user system elapsed
2.086 0.364 2.691
>>> start = time(); s = df.median(); print(time() - start)
1.9600980281829834
@time med = median.(eachcol(df)); # 1.667501 seconds
二変量関数
ピアソンの積率相関係数
> system.time({cor(df[,1], df[,2])})
user system elapsed
0.005 0.001 0.005
>>> import numpy as np
>>> start = time(); r = np.corrcoef(df['X1'], df['X2'])[0,1]; print(time() - start)
0.0030248165130615234
>>> from scipy.stats import pearsonr
>>> start = time(); r = pearsonr(df['X1'], df['X2'])[0]; print(time() - start)
0.0011849403381347656
@time cor(df.X1, df.X2) # 0.000154 seconds
多変数関数
ピアソンの積率相関係数行列
> system.time({r = cor(df)})
user system elapsed
106.154 0.377 106.509
>>> start = time(); r = df.corr(); print(time() - start)
377.98001408576965
@time r = cor(Matrix(df)); # 3.985498 seconds
スピアマンの順位相関係数行列
> system.time({rs = cor(df[1:200], method="spearman")})
user system elapsed
4.698 0.152 4.858
>>> start = time(); rs = df.iloc[:, 0:200].corr(method='spearman'); print(time() - start)
23.894737243652344
@time rs = corspearman(Matrix(df[:, 1:200])); # 217.821656 seconds
ケンドールの順位相関係数行列
> system.time({rk = cor(df[1:5], method="kendall")})
user system elapsed
327.977 0.465 328.402
探索していて見つけたライブラリ
https://rdrr.io/cran/pcaPP/man/cor.fk.html
library(pcaPP)
> system.time({cor.fk(df[1:5])})
user system elapsed
0.106 0.002 0.109
> system.time({cor.fk(df[1:50])})
user system elapsed
10.330 0.262 10.560
>>> start = time(); rk = df.iloc[:, 0:5].corr(method='kendall'); print(time() - start)
0.2503321170806885
>>> start = time(); rk = df.iloc[:, 0:50].corr(method='kendall'); print(time() - start)
26.7141010761261
@time rk = corkendall(Matrix(df[:, 1:5])) # 0.134137 seconds
@time rk = corkendall(Matrix(df[:, 1:50])) # 11.786005 seconds;
分散共分散行列
> system.time({v = var(df)})
user system elapsed
106.169 0.688 106.949
>>> start = time(); v = df.cov(); print(time() - start)
3.9224328994750977
@time v = cov(Matrix(df)); # 6.487346 seconds
重回帰分析
> system.time({ans = lm(X1 ~ X2+X3+X4+X5+X6+X7+X8+X9+X10, data=df)
+ summary.ans = summary(ans)
+ coeff = ans$coeff
+ R2 = summary(ans)$r.squared})
user system elapsed
0.046 0.028 0.136
>>> from sklearn.linear_model import LinearRegression
>>> start = time();
>>> clf = LinearRegression()
>>> y = df['X1']
>>> x = df[['X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10']]
>>> ans = clf.fit(x, y)
>>> coef = ans.coef_
>>> intercept = ans.intercept_
>>> print(time() - start)
0.031842947006225586
using GLM
@time begin
result = lm(@formula(X1 ~ X2+X3+X4+X5+X6+X7+X8+X9+X10), df)
coefs = coef(result);
R2 = r2(result)
end # 0.025926 seconds
独立変数の個数が非常に多い場合
> system.time({
+ ans = lm(X1 ~ ., data=df)
+ summary.ans = summary(ans)
+ coeff = ans$coeff
+ R2 = summary(ans)$r.squared})
user system elapsed
168.687 2.262 172.028
>>> start = time();
>>> y = df['X1']
>>> x = df.drop('X1', axis=1)
>>> ans = clf.fit(x, y)
>>> coef = ans.coef_
>>> intercept = ans.intercept_
>>> R2 = ans.score(x, y)
>>> print(time() - start)
11.533896923065186
using GLM
@time fit(LinearModel, hcat(ones(size(df, 1)), Matrix(select(df, Not(:X1)))), df.X1); # 5.237764 seconds