各列が 0/1 の二値データ,グループ変数別,列ごとに 1 の個数を集計する。
データの生成と保存
サンプルサイズ n = 20000,列数 m = 20, グループ変数 二値データ
using RCall
R"""
set.seed(123)
n = 20000
m = 20
df = as.data.frame(matrix(sample(0:1, n*m, replace=TRUE), n, m),
row.names=paste0("V", 1:m))
df$ID = 1:n
df$gender = factor(sample(c("male", "female"), n, replace=TRUE),
levels=c("male", "female"))
write.csv(df, "multianswer.csv", row.names=FALSE)
"""
tidyverse では, df[*, *] のような記述をひどく忌み嫌うとかで,
20 変数の集計は,20行書かなければならない(1000変数だったら,プログラムでプログラムを書く)。
R"""
library(tidyverse)
library(dplyr)
library(tidyr)
df = read_csv("multianswer.csv", show_col_types = FALSE)
system.time({
cross <- df %>%
group_by(gender) %>%
summarise(
V1 = sum(V1, na.rm = TRUE),
V2 = sum(V2, na.rm = TRUE),
V3 = sum(V3, na.rm = TRUE),
V4 = sum(V4, na.rm = TRUE),
V5 = sum(V5, na.rm = TRUE),
V6 = sum(V6, na.rm = TRUE),
V7 = sum(V7, na.rm = TRUE),
V8 = sum(V8, na.rm = TRUE),
V9 = sum(V9, na.rm = TRUE),
V10 = sum(V10, na.rm = TRUE),
V11 = sum(V11, na.rm = TRUE),
V12 = sum(V12, na.rm = TRUE),
V13 = sum(V13, na.rm = TRUE),
V14 = sum(V14, na.rm = TRUE),
V15 = sum(V15, na.rm = TRUE),
V16 = sum(V16, na.rm = TRUE),
V17 = sum(V17, na.rm = TRUE),
V18 = sum(V18, na.rm = TRUE),
V19 = sum(V19, na.rm = TRUE),
V20 = sum(V20, na.rm = TRUE))
})
# cross
"""
RObject{VecSxp}
# A tibble: 2 x 21
gender V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 female 4888 4954 4894 4940 4931 4964 4915 4974 4952 4978 5012 5017
2 male 5054 5042 5079 5065 4992 5028 5105 5021 5095 5011 5084 5109
# ... with 8 more variables: V13 , V14 , V15 , V16 ,
# V17 , V18 , V19 , V20
RObject{RealSxp}
user system elapsed
0.004 0.001 0.005
base-R で書けば,至極簡単。何万変数あったって,なんてことないね。
df = read.csv("multianswer.csv")
m = 20
system.time({i = m + 2
j = 1:m
ans <- sapply(j, function(k) table(df[, i], df[, k]))[3:4, ]
# ans
"""
RObject{IntSxp}
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 4888 4954 4894 4940 4931 4964 4915 4974 4952 4978 5012 5017 4901 5000
[2,] 5054 5042 5079 5065 4992 5028 5105 5021 5095 5011 5084 5109 5008 4927
[,15] [,16] [,17] [,18] [,19] [,20]
[1,] 4856 4971 4936 4920 4987 4933
[2,] 5086 4991 5068 4986 4982 5095
RObject{RealSxp}
user system elapsed
0.027 0.001 0.028
まあ,Julia だと 25 倍速い。とは言っても,1 秒未満なのだから競争しても意味ないが。
using CSV, DataFrames
function multianswer(df, m)
gdf = groupby(df, :gender);
vcat(sum(Matrix(gdf[2][!, 1:m]), dims=1), sum(Matrix(gdf[1][!, 1:m]), dims=1))
end
df = CSV.read("multianswer.csv", DataFrame);
@time multianswer(df, 20)
0.001102 seconds (413 allocations: 3.527 MiB)
2×20 Matrix{Int64}:
4888 4954 4894 4940 4931 4964 … 4856 4971 4936 4920 4987 4933
5054 5042 5079 5065 4992 5028 5086 4991 5068 4986 4982 5095