optimization - calculating correlations efficiently in R? -


this optimization problem i'm hoping creative users may have answer to.

i have large matrix (5 million x 2) 2 values: time , type. in essence, each "type" own time series -- below data represents 3 different time series (one a, 1 b, , 1 c). there 2000 different "types".

mat      time type [1,]  50   [2,]  50   [3,]  12   b [4,]  24   b [5,]  80   b [6,]  92   b [7,]  43   c [8,]  69   c 

what efficient way me find correlation between these 2000 time series? producing matrix there different bins each time event have occurred, , populate matrix how many events of each "type" occurred in time slot. after populating matrix, loop on each pair of "type"s , find correlations. extremely inefficient (~5 hours).

my whole problem solved if there exists way implement by='type' feature in cor function of r?

thanks insight.

you can try this

set.seed(1) df <-  data.frame(time = rnorm(15), type = rep(c("a", "b", "c"), each = 5))  cor(do.call(cbind, split(df$time, df$type)))                 b        c  1.00000  0.27890 -0.61497 b  0.27890  1.00000 -0.78641 c -0.61497 -0.78641  1.00000 

this approach assume number of observations per type balanced.

now, can real test 5 millions rows , 2000 differents types

set.seed(1) df <- data.frame(time = rnorm(5e6), type = sample(rep(1:2000, each = 2500))) system.time(cor(do.call(cbind, split(df$time, df$type)))) ##  user  system elapsed  ## 6.387   0.000   6.391  

Comments

Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -