memory management - OpenCL Matrix Average calculation optimizing? -
i trying calculate average of webcam stream in python using pyopencl. test trying calculate average of representative matrix on number of frames, can seen below:
import pyopencl cl import numpy np import time import os os.environ['pyopencl_ctx']='0' ctx = cl.create_some_context() queue = cl.commandqueue(ctx) length = 480 width = 320 nframes = 60 matrix = np.zeros(shape=(length,width,nframes)).astype(np.float32) in range(nframes): matrix[:,:,i] = float(i) matrix_gpu = np.zeros(shape=(length,width)).astype(np.float32) matrix_cpu = np.zeros_like(matrix_gpu) final_matrix = np.zeros_like(matrix2t) matrix_gpu_vector = np.reshape(matrix_gpu,matrix_gpu.size) mf = cl.mem_flags dest_buf = cl.buffer(ctx, mf.write_only, matrix_gpu.nbytes) prg = cl.program(ctx, """ __kernel void summatrices(const unsigned int size, __global float * a, __global float * b, __global float * sum) { int = get_global_id(0); sum[i] = a[i] + b[i]; } """).build() t0 = time.time() in range(nframes): matrix_gpu = matrix[:,:,i].astype(np.float32) matrix_gpu_vector = np.reshape(matrix_gpu,matrix_gpu.size) a_buf = cl.buffer(ctx, mf.read_only | mf.copy_host_ptr, hostbuf=matrix_gpu_vector) b_buf = cl.buffer(ctx, mf.read_only | mf.copy_host_ptr, hostbuf=final_matrix) prg.summatrices(queue, matrix_gpu_vector.shape, none,np.int32(len(matrix_gpu_vector)), a_buf, b_buf, dest_buf) temp_matrix = np.empty_like(matrix_gpu_vector) cl.enqueue_copy(queue, temp_matrix , dest_buf) final_matrix = temp_matrix final_matrix = final_matrix/nframes final_matrix = np.reshape(final_matrix,(length,width)) delta_t = time.time() - t0 print 'opencl gpu multiplication: ' + str(delta_t) matrix_cpu = np.sum(matrix[:,:,:], axis=2)/nframes delta_t = time.time() - (t0 + delta_t) print 'opencl cpu multiplication: ' + str(delta_t) #print matrix #print final_matrix #print matrix_cpu eq = (final_matrix==matrix_cpu).all() print eq it appears, however, code factor 30 slower on gpu on cpu. due use of for-loop , lack of workgroup allocation.
is possible strip out python for-loop , allocate workgroups properly?
since said test guess @ end of day want more computation add. here 2 things can try enhance code:
- do not create each time a_buf , b_buf. creation of buffers costly. create them outside of loop , in loop use
cl.enqueue_write_buffer()function orcl.enqueue_copy(). seems first function deprecate , replace second one. shouldcl.enqueue_copy(queue, a_buf, matrix_gpu_vector) - i guess doesn't cost not reshape matrix
matrix_gpuit's aligned in memory.
btw if making test simulate app receives stream webcam, shouldn't use rgb matrix? mean receive 24bit rgb image wich simulated better int8, no?
Comments
Post a Comment