opencl - How do I make a strided copy from global to local memory? -
i want copy data buffer in global device memory local memory of processing core - but, twist.
i know async_work_group_copy, , it's nice (or rather, it's klunky , annoying, working). however, data not contiguous - strided, i.e. there might x bytes between every 2 consecutive y bytes want copy.
obviously i'm not going copy useless data - , might not fit in local memory. can instead? want avoid writing actual kernel code copying, e.g.
threadid = get_local_id(0); if (threadid < length) { unsigned offset = threadid * stride; localdata[threadid] = globaldata[offset]; }
you can use async_work_group_strided_copy() opencl api call.
here small example in pyopencl @darkzeros' comment. let's assume small stripe of rgb image, says 4 1 that:
img = np.array([58, 83, 39, 157, 190, 199, 64, 61, 5, 214, 141, 6]) and want access 4 red channels i.e. [58 157 64 214] you'd do:
def test_asyc_copy_stride_to_local(self): #create context, queue, program first .... #number of r channels nb_of_el = 4 img = np.array([58, 83, 39, 157, 190, 199, 64, 61, 5, 214, 141, 6]) cl_input = cl.buffer(ctx, mf.read_only | mf.copy_host_ptr, hostbuf=img) #buffer used check if copy correct cl_output = cl.buffer(ctx, mf.write_only, size=nb_of_el * np.dtype('int32').itemsize) lcl_buf = cl.localmemory(nb_of_el * np.dtype('int32').itemsize) prog.asyncopytolocalwithstride(queue, (nb_of_el,), none, cl_input, cl_output, lcl_buf) result = np.zeros(nb_of_el, dtype=np.int32) cl.enqueue_copy(queue, result, cl_output).wait() print result the kernel:
kernel void asyncopytolocalwithstride(global int *in, global int *out, local int *localbuf){ const int idx = get_global_id(0); localbuf[idx] = 0; //copy 4 elements, stride = 3 (rgb) event_t ev = async_work_group_strided_copy(localbuf, in, 4, 3, 0); wait_group_events (1, &ev); out[idx] = localbuf[idx]; }
Comments
Post a Comment