r - Why is DT1[DT2][, value1-value] faster than DT1[DT2, value1-value] on data.table with fewer columns? -
this related question (can access repeated column names in `j` in data.table join?), asked because assumed opposite true.
data.table 2 columns:
suppose wish join 2 data.tables
, perform simple operation on 2 joined columns, can done either in 1 or 2 calls .[
:
n = 1000000 dt1 = data.table(name = 1:n, value = rnorm(n)) dt2 = data.table(name = 1:n, value1 = rnorm(n)) setkey(dt1, name) system.time({x = dt1[dt2, value1 - value]}) # 1 step system.time({x = dt1[dt2][, value1 - value]}) # 2 step
it turns out making 2 calls - doing join first, , doing subtraction - noticeably quicker in 1 go.
> system.time({x = dt1[dt2, value1 - value]}) user system elapsed 0.67 0.00 0.67 > system.time({x = dt1[dt2][, value1 - value]}) user system elapsed 0.14 0.01 0.16
why this?
data.table many columns:
if put lot of columns in data.table
find 1 step approach quicker - presumably because data.table
uses columns reference in j
.
n = 1000000 dt1 = data.table(name = 1:n, value = rnorm(n))[, (letters) := pi][, (letters) := pi][, (month.abb) := pi] dt2 = data.table(name = 1:n, value1 = rnorm(n))[, (letters) := pi][, (letters) := pi][, (month.abb) := pi] setkey(dt1, name) system.time({x = dt1[dt2, value1 - value]}) system.time({x = dt1[dt2][, value1 - value]}) > system.time({x = dt1[dt2, value1 - value]}) user system elapsed 0.89 0.02 0.90 > system.time({x = dt1[dt2][, value1 - value]}) user system elapsed 1.64 0.16 1.81
i think due repeated subsetting dt1[dt2, value1-value]
makes every name
in dt2
. is, you've perform j
operation each i
here, opposed 1 j
operation after join
. becomes quite costly 1e6 unique entries. is, [.data.table
becomes significant , noticeable.
dt1[dt2][, value1-value] # similar rowsums dt1[dt2, value1-value]
in first case, dt1[dt2]
, perform join
first, , really fast. of course, more columns, show, you'll see difference. point performing join once. in second case, you're grouping dt1 dt2's name , every 1 of them you're computing difference. is, you're subsetting dt1
each value of dt2
- 1 'j' operation per subset! can see better running this:
rprof() t1 <- dt1[dt2, value1-value] rprof(null) summaryrprof() # $by.self # self.time self.pct total.time total.pct # "[.data.table" 0.96 97.96 0.98 100.00 # "-" 0.02 2.04 0.02 2.04 rprof() t2 <- dt1[dt2][, value1-value] rprof(null) summaryrprof() # $by.self # self.time self.pct total.time total.pct # "[.data.table" 0.22 84.62 0.26 100.00 # "-" 0.02 7.69 0.02 7.69 # "is.unsorted" 0.02 7.69 0.02 7.69
this overhead in repeated subsetting seems overcome when you've many columns , join
on many columns overtakes time-consuming operation. can check out profiling other code.
Comments
Post a Comment