r - Save storage space for small integers or factors with few levels -
r seems require 4 bytes of storage per integer, small ones:
> object.size(rep(1l, 10000)) 40040 bytes
and, more, factors:
> object.size(factor(rep(1l, 10000))) 40456 bytes
i think, in latter case handled better. there solution me reduce storage requirements case 8 or 2 bits per row? perhaps solution uses raw
type internally storage behaves normal factor otherwise. bit
package offers bits, haven't found similar factors.
my data frame few millions of rows consuming gigabytes, , that's huge waste of memory , run time (!). compression reduce required disk space, again @ expense of run time.
related:
since mention raw
(and assuming have less 256 factor levels) - could prerequisite conversion operations if memory bottleneck , cpu time isn't. example:
f = factor(rep(1l, 1e5)) object.size(f) # 400456 bytes f.raw = as.raw(f) object.size(f.raw) #100040 bytes # go back: identical(as.factor(as.integer(f.raw)), f) #[1] true
you can save factor levels separately , recover them if that's you're interested in doing, far grouping , goes can raw
, never go factors (except presentation).
if have specific use cases have trouble method, please post it, otherwise think should work fine.
here's starting point byte.factor
class:
byte.factor = function(f) { res = as.raw(f) attr(res, "levels") <- levels(f) attr(res, "class") <- "byte.factor" res } as.factor.byte.factor = function(b) { factor(attributes(b)$levels[as.integer(b)], attributes(b)$levels) }
so can things like:
f = factor(c('a','b'), letters) f #[1] b #levels: b c d e f g h j k l m n o p q r s t u v w x y z b = byte.factor(f) b #[1] 01 02 #attr(,"levels") # [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" #[20] "t" "u" "v" "w" "x" "y" "z" #attr(,"class") #[1] "byte.factor" as.factor.byte.factor(b) #[1] b #levels: b c d e f g h j k l m n o p q r s t u v w x y z
check out how data.table
overrides rbind.data.frame
if want make as.factor
generic , add whatever functions want add. should quite straightforward.
Comments
Post a Comment