optimization - CUDA optimisation - kernel launch conditions -

July 15, 2010

i new cuda , find out more optimising kernel launch conditions speed code. quite specific scenario i'll try generalise as possible else similar question can gain in future.

assume i've got array of 300 elements (array a) sent kernel input. array made of few repeating integers each integer having device function specific it. example, every time 5 appears in array a, kernel performs function specific 5. these functions device functions.

how have parallelised problem launching 320 blocks (probably not best number) each block perform device function relevant element in parallel.

the cpu handle entire problem in serial fashion take element element , call each function 1 after other whereas gpu allocate element each block 320 blocks can access relevant device functions , calculate simultaneously.

in theory large number of elements gpu should faster - @ least though in case isn't. assumption since 300 elements small number cpu faster gpu.

this acceptable want know how can cut down gpu execution time @ least little. currently, cpu takes 2.5 milliseconds , gpu around 12 ms.

question 1 - how can choose optimum number of blocks/threads launch @ start? first tried 320 blocks 1 thread per block. 1 block 320 threads. no real change in execution time. tweaking number of blocks/threads improve speed?

question 2 - if 300 elements small, why that, , how many elements need see gpu outperforming cpu?

question 3 - optimisation techniques should into?

please let me know if of isn't clear , i'll expand on it.

thanks in advance.

internally, cuda manages threads in groups of 32 (so-called warps). if have 1 thread per block device still execute 32 of - 31 thread in divergent state. potentially occupancy issue though may not observe on device , problem size. there limit on number of blocks given multiprocessor (sm) can execute. afair, geforce 4x can run 8 blocks on 1 sm. hence if have device 8 sms can simultaneously run 64 threads if have block size of 1. can use tool called occupancy calculator estimate better block size - or can use visual profiler.
this can decided profiling. there many unknowns - e.g. ratio of memory accesses actual computations, how parallelizable task is, etc.
i recommend start best practices guide.

Search This Blog

Live

optimization - CUDA optimisation - kernel launch conditions -

Comments

Post a Comment

Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -