memory management - OpenCL Matrix Average calculation optimizing? -


i trying calculate average of webcam stream in python using pyopencl. test trying calculate average of representative matrix on number of frames, can seen below:

import pyopencl cl import numpy np import time import os  os.environ['pyopencl_ctx']='0'   ctx = cl.create_some_context() queue = cl.commandqueue(ctx) length = 480 width = 320 nframes = 60  matrix = np.zeros(shape=(length,width,nframes)).astype(np.float32) in range(nframes):   matrix[:,:,i] = float(i)  matrix_gpu = np.zeros(shape=(length,width)).astype(np.float32) matrix_cpu = np.zeros_like(matrix_gpu) final_matrix = np.zeros_like(matrix2t)  matrix_gpu_vector = np.reshape(matrix_gpu,matrix_gpu.size)    mf = cl.mem_flags dest_buf = cl.buffer(ctx, mf.write_only, matrix_gpu.nbytes)   prg = cl.program(ctx, """     __kernel void summatrices(const unsigned int size,                    __global float * a,                    __global float * b,                    __global float * sum)      {     int = get_global_id(0);      sum[i] = a[i] + b[i];     }     """).build()   t0 =  time.time()  in range(nframes):     matrix_gpu = matrix[:,:,i].astype(np.float32)     matrix_gpu_vector = np.reshape(matrix_gpu,matrix_gpu.size)     a_buf = cl.buffer(ctx, mf.read_only | mf.copy_host_ptr, hostbuf=matrix_gpu_vector)     b_buf = cl.buffer(ctx, mf.read_only | mf.copy_host_ptr, hostbuf=final_matrix)     prg.summatrices(queue, matrix_gpu_vector.shape, none,np.int32(len(matrix_gpu_vector)), a_buf, b_buf, dest_buf)     temp_matrix = np.empty_like(matrix_gpu_vector)     cl.enqueue_copy(queue, temp_matrix , dest_buf)     final_matrix = temp_matrix  final_matrix = final_matrix/nframes final_matrix = np.reshape(final_matrix,(length,width)) delta_t =  time.time()  - t0   print 'opencl gpu multiplication: ' + str(delta_t) matrix_cpu = np.sum(matrix[:,:,:], axis=2)/nframes delta_t =  time.time()  - (t0 + delta_t)  print 'opencl cpu multiplication: ' + str(delta_t) #print matrix #print final_matrix #print matrix_cpu  eq = (final_matrix==matrix_cpu).all() print eq 

it appears, however, code factor 30 slower on gpu on cpu. due use of for-loop , lack of workgroup allocation.

is possible strip out python for-loop , allocate workgroups properly?

since said test guess @ end of day want more computation add. here 2 things can try enhance code:

  1. do not create each time a_buf , b_buf. creation of buffers costly. create them outside of loop , in loop use cl.enqueue_write_buffer() function or cl.enqueue_copy(). seems first function deprecate , replace second one. should cl.enqueue_copy(queue, a_buf, matrix_gpu_vector)
  2. i guess doesn't cost not reshape matrix matrix_gpu it's aligned in memory.

btw if making test simulate app receives stream webcam, shouldn't use rgb matrix? mean receive 24bit rgb image wich simulated better int8, no?


Comments

Popular posts from this blog

How to mention the localhost in android -

php - Calling a template part from a post -