memory management - OpenCL Matrix Average calculation optimizing? -

August 15, 2015

i trying calculate average of webcam stream in python using pyopencl. test trying calculate average of representative matrix on number of frames, can seen below:

import pyopencl cl import numpy np import time import os  os.environ['pyopencl_ctx']='0'   ctx = cl.create_some_context() queue = cl.commandqueue(ctx) length = 480 width = 320 nframes = 60  matrix = np.zeros(shape=(length,width,nframes)).astype(np.float32) in range(nframes):   matrix[:,:,i] = float(i)  matrix_gpu = np.zeros(shape=(length,width)).astype(np.float32) matrix_cpu = np.zeros_like(matrix_gpu) final_matrix = np.zeros_like(matrix2t)  matrix_gpu_vector = np.reshape(matrix_gpu,matrix_gpu.size)    mf = cl.mem_flags dest_buf = cl.buffer(ctx, mf.write_only, matrix_gpu.nbytes)   prg = cl.program(ctx, """     __kernel void summatrices(const unsigned int size,                    __global float * a,                    __global float * b,                    __global float * sum)      {     int = get_global_id(0);      sum[i] = a[i] + b[i];     }     """).build()   t0 =  time.time()  in range(nframes):     matrix_gpu = matrix[:,:,i].astype(np.float32)     matrix_gpu_vector = np.reshape(matrix_gpu,matrix_gpu.size)     a_buf = cl.buffer(ctx, mf.read_only | mf.copy_host_ptr, hostbuf=matrix_gpu_vector)     b_buf = cl.buffer(ctx, mf.read_only | mf.copy_host_ptr, hostbuf=final_matrix)     prg.summatrices(queue, matrix_gpu_vector.shape, none,np.int32(len(matrix_gpu_vector)), a_buf, b_buf, dest_buf)     temp_matrix = np.empty_like(matrix_gpu_vector)     cl.enqueue_copy(queue, temp_matrix , dest_buf)     final_matrix = temp_matrix  final_matrix = final_matrix/nframes final_matrix = np.reshape(final_matrix,(length,width)) delta_t =  time.time()  - t0   print 'opencl gpu multiplication: ' + str(delta_t) matrix_cpu = np.sum(matrix[:,:,:], axis=2)/nframes delta_t =  time.time()  - (t0 + delta_t)  print 'opencl cpu multiplication: ' + str(delta_t) #print matrix #print final_matrix #print matrix_cpu  eq = (final_matrix==matrix_cpu).all() print eq

it appears, however, code factor 30 slower on gpu on cpu. due use of for-loop , lack of workgroup allocation.

is possible strip out python for-loop , allocate workgroups properly?

since said test guess @ end of day want more computation add. here 2 things can try enhance code:

do not create each time a_buf , b_buf. creation of buffers costly. create them outside of loop , in loop use cl.enqueue_write_buffer() function or cl.enqueue_copy(). seems first function deprecate , replace second one. should cl.enqueue_copy(queue, a_buf, matrix_gpu_vector)
i guess doesn't cost not reshape matrix matrix_gpu it's aligned in memory.

btw if making test simulate app receives stream webcam, shouldn't use rgb matrix? mean receive 24bit rgb image wich simulated better int8, no?

Search This Blog

Live

memory management - OpenCL Matrix Average calculation optimizing? -

Comments

Post a Comment

Popular posts from this blog

javascript - JS causing window size to be bigger than necessary - Dropdown bug -

How to mention the localhost in android -

php - Calling a template part from a post -