c++ - SSE/NEON table lookup optimization -


i have following lookup , interpolation code optimize. (float table size 128) used intel compiler on windows, gcc on osx , gcc neon osx.

for(unsigned int = 0 ; < 4 ; i++) {     const int iidx = (int)m_findex[i];     const float frac = m_findex - iidx;     m_fresult[i] = sftable[iidx].val + sftable[iidx].val2 * frac; } 

i vecorized sse/neon. (the macros convert sse/neon instructions)

vec_int iidx = vec_float2int(m_findex); vec_float frac = vec_sub(m_findex ,vec_int2float(iidx); m_fresult[0] = sftable[iidx[0]].val2; m_fresult[1] = sftable[iidx[1]].val2; m_fresult[2] = sftable[iidx[2]].val2; m_fresult[3] = sftable[iidx[3]].val2; m_fresult=vec_mul( m_fresult,frac); frac[0] = sftable[iidx[0]].val1; frac[1] = sftable[iidx[1]].val1; frac[2] = sftable[iidx[2]].val1; frac[3] = sftable[iidx[3]].val1; m_fresult=vec_add( m_fresult,frac); 

i think table access , move aligned memory real bottleneck here. not assembler there lot of unpcklps , mov:

10026751  mov         eax,dword ptr [esp+4270h]  10026758  movaps      xmm3,xmmword ptr [eax+16640h]  1002675f  cvttps2dq   xmm5,xmm3  10026763  cvtdq2ps    xmm4,xmm5  10026766  movd        edx,xmm5  1002676a  movdqa      xmm6,xmm5  1002676e  movdqa      xmm1,xmm5  10026772  psrldq      xmm6,4  10026777  movdqa      xmm2,xmm5  1002677b  movd        ebx,xmm6  1002677f  subps       xmm3,xmm4  10026782  psrldq      xmm1,8  10026787  movd        edi,xmm1  1002678b  psrldq      xmm2,0ch  10026790  movdqa      xmmword ptr [esp+4f40h],xmm5  10026799  mov         ecx,dword ptr [eax+edx*8+10cf4h]  100267a0  movss       xmm0,dword ptr [eax+edx*8+10cf4h]  100267a9  mov         dword ptr [eax+166b0h],ecx  100267af  movd        ecx,xmm2  100267b3  mov         esi,dword ptr [eax+ebx*8+10cf4h]  100267ba  movss       xmm4,dword ptr [eax+ebx*8+10cf4h]  100267c3  mov         dword ptr [eax+166b4h],esi  100267c9  mov         edx,dword ptr [eax+edi*8+10cf4h]  100267d0  movss       xmm7,dword ptr [eax+edi*8+10cf4h]  100267d9  mov         dword ptr [eax+166b8h],edx  100267df  movss       xmm1,dword ptr [eax+ecx*8+10cf4h]  100267e8  unpcklps    xmm0,xmm7  100267eb  unpcklps    xmm4,xmm1  100267ee  unpcklps    xmm0,xmm4  100267f1  mulps       xmm0,xmm3  100267f4  movaps      xmmword ptr [eax+166b0h],xmm0  100267fb  mov         ebx,dword ptr [esp+4f40h]  10026802  mov         edi,dword ptr [esp+4f44h]  10026809  mov         ecx,dword ptr [esp+4f48h]  10026810  mov         esi,dword ptr [esp+4f4ch]  10026817  movss       xmm2,dword ptr [eax+ebx*8+10cf0h]  10026820  movss       xmm5,dword ptr [eax+edi*8+10cf0h]  10026829  movss       xmm3,dword ptr [eax+ecx*8+10cf0h]  10026832  movss       xmm6,dword ptr [eax+esi*8+10cf0h]  1002683b  unpcklps    xmm2,xmm3  1002683e  unpcklps    xmm5,xmm6  10026841  unpcklps    xmm2,xmm5  10026844  mulps       xmm2,xmm0  10026847  movaps      xmmword ptr [eax+166b0h],xmm2 

when profiling there not benefit sse version on win.

do have suggestions how improve ? side effects neon/gcc expected ?

currently consider making first part vecorized , tablereadout , interpolation in loop, hoping benefit compiler optimization.

osx? has nothing neon.

btw, neon cannot handle luts large anyway. (i don't know sse matter)

verify first if sse can handle luts of size, if yes, suggest using different compiler since gcc tends make intrinsucks out of intrinsics.


Comments

Popular posts from this blog

How to mention the localhost in android -

php - Calling a template part from a post -