variables - multicolliarity test in R -
i'm working large data set suspect have multicollinearity issues because var-covariance matrix has negative eigenvalue (and small when comparing rest); ratio max eigenvalue/min eigenvalue > 3000;
my question is: there test routine in r identify variables redundant (i don't work regression models); might linear regression pair graphs or use pairs(data) command appreciate numerical tests because have 200 variables , graphs aren't decision support in matter.
if undesrtood correctly looking for:
if have in mind correlation threshold want use exclude variables try following
in example here i'm generating random matrix
> set.seed(3) > data <- data.frame(v1=rnorm(20),v2=rnorm(20),v3=rnorm(20),v4=rnorm(20),v5=rnorm(20)) > cor.mat <- cor(data) > diag(cor.mat)=0
this correlation matrix , variables v1, v2, v3, v4, v5
> cor.mat v1 v2 v3 v4 v5 v1 0.00000000 -0.14464568 0.09047839 -0.1200863 -0.1110384 v2 -0.14464568 0.00000000 0.04340839 0.1929009 -0.4354569 v3 0.09047839 0.04340839 0.00000000 0.1185795 0.1760463 v4 -0.12008631 0.19290090 0.11857953 0.0000000 -0.2080077 v5 -0.11103839 -0.43545694 0.17604633 -0.2080077 0.0000000
now substitute in following loop, in if
statement, threshold value want use select redundant variables (here use .4 if not indicate redundancy highest value came out random matrix).
> high_cor = vector() > (i in 1:nrow(cor.mat)){ + (j in 1:ncol(cor.mat)){ + if (abs(cor.mat[i,j]) >= 0.4) {high_cor[i]=paste(rownames(cor.mat)[i], "-", + colnames(cor.mat)[j])} + } + } > high_cor <- high_cor[!is.na(high_cor)]
in case variables correlate > .4 v2 , v5:
> high_cor [1] "v2 - v5" "v5 - v2"
hope helps
Comments
Post a Comment