optimization - calculating correlations efficiently in R? -
this optimization problem i'm hoping creative users may have answer to.
i have large matrix (5 million x 2) 2 values: time , type. in essence, each "type" own time series -- below data represents 3 different time series (one a, 1 b, , 1 c). there 2000 different "types".
mat time type [1,] 50 [2,] 50 [3,] 12 b [4,] 24 b [5,] 80 b [6,] 92 b [7,] 43 c [8,] 69 c
what efficient way me find correlation between these 2000 time series? producing matrix there different bins each time event have occurred, , populate matrix how many events of each "type" occurred in time slot. after populating matrix, loop on each pair of "type"s , find correlations. extremely inefficient (~5 hours).
my whole problem solved if there exists way implement by='type'
feature in cor
function of r?
thanks insight.
you can try this
set.seed(1) df <- data.frame(time = rnorm(15), type = rep(c("a", "b", "c"), each = 5)) cor(do.call(cbind, split(df$time, df$type))) b c 1.00000 0.27890 -0.61497 b 0.27890 1.00000 -0.78641 c -0.61497 -0.78641 1.00000
this approach assume number of observations per type balanced.
now, can real test 5 millions rows , 2000 differents types
set.seed(1) df <- data.frame(time = rnorm(5e6), type = sample(rep(1:2000, each = 2500))) system.time(cor(do.call(cbind, split(df$time, df$type)))) ## user system elapsed ## 6.387 0.000 6.391
Comments
Post a Comment