image processing - Improve performance of runtime-determinded nested for-loops in CUDA -
question: in cuda, there general way of improving performance of nested for-loops conditions determined @ runtime (and therefore can't unrolled compiler)?
background: working on cuda implementation of 2d image filter algorithm. each pixel of input value of output calculated looking @ (2*r+1) * (2*r+1)
neighbouring pixels. although r
constant each image, shape of filter dependent on value @ each pixel , hence can't converted true convolution operation or decomposed 2 1d operations.
i have efficient implementation when filter radius r
known @ compile time, based on scatter approach (which faster gather approach come with) each pixel in input assigned thread. output divided tiles kept in shared memory. @ heart of algorithm nested for-loop executed each thread:
for(int i=-r; i<r+1; i++) { for(int j=-r; j<r+1; j++) { // calculate , scatter value output[offsetj + j][offseti + i] } }
i have generalised code r
given @ runtime using dynamically allocated shared memory. although generated result still correct, execution between 1.5 3 times slower depending on value of r
. through tests have concluded slow-down due runtime determination of conditions of above for-loops, meaning compiler can't unroll loops assume otherwise done.
if has suggestions on how improve performance in particular case or knows of similar implementation tips welcome. ideas far either compile different kernels each value of r
, or rid of inner loop (but not sure how help).
as gathered comments options seem manual unrolling (which not applicable in case), use of templates , runtime code generation. best option in particular case, r
in range of 1 - 25 seems to create explicit templates few different cases , padding in-between values zeros. since complexity grows quadratically r
, if values r
equally common, seems reasonable sample range more densely @ higher end, e.g. create templates r
equals 8, 14, 18, 21, 23, , 25.
Comments
Post a Comment