c++ - Penalty for switching from SSE to AVX? -
i'm aware of existing penalty switching avx instructions sse instructions without first zeroing out upper halves of ymm registers, in particular case on machine (i7-3939k 3.2ghz), there seems large penalty going other way around (sse avx), if explicitly use _mm256_zeroupper before , after avx code section.
i have written functions converting between 32 bit floats , 32 bit fixed point integers, on 2 buffers 32768 elements wide. ported sse2 intrinsic version directly avx 8 elements @ once on sse's 4, expecting see significant performance increase, unfortunately, opposite happened.
so, have 2 functions:
void convertpcm32floattopcm32fixed(int32* outbuffer, const float* inbuffer, uint samplecount, bool buseavx) { const float fscale = (float)(1u<<31); if (buseavx) { _mm256_zeroupper(); const __m256 vscale = _mm256_set1_ps(fscale); const __m256 vvolmax = _mm256_set1_ps(fscale-1); const __m256 vvolmin = _mm256_set1_ps(-fscale); (uint = 0; < samplecount; i+=8) { const __m256 vin0 = _mm256_load_ps(inbuffer+i); // aligned load const __m256 vval0 = _mm256_mul_ps(vin0, vscale); const __m256 vclamped0 = _mm256_min_ps( _mm256_max_ps(vval0, vvolmin), vvolmax ); const __m256i vfinal0 = _mm256_cvtps_epi32(vclamped0); _mm256_store_si256((__m256i*)(outbuffer+i), vfinal0); // aligned store } _mm256_zeroupper(); } else { const __m128 vscale = _mm_set1_ps(fscale); const __m128 vvolmax = _mm_set1_ps(fscale-1); const __m128 vvolmin = _mm_set1_ps(-fscale); (uint = 0; < samplecount; i+=4) { const __m128 vin0 = _mm_load_ps(inbuffer+i); // aligned load const __m128 vval0 = _mm_mul_ps(vin0, vscale); const __m128 vclamped0 = _mm_min_ps( _mm_max_ps(vval0, vvolmin), vvolmax ); const __m128i vfinal0 = _mm_cvtps_epi32(vclamped0); _mm_store_si128((__m128i*)(outbuffer+i), vfinal0); // aligned store } } } void convertpcm32fixedtopcm32float(float* outbuffer, const int32* inbuffer, uint samplecount, bool buseavx) { const float fscale = (float)(1u<<31); if (buseavx) { _mm256_zeroupper(); const __m256 vscale = _mm256_set1_ps(1/fscale); (uint = 0; < samplecount; i+=8) { __m256i vin0 = _mm256_load_si256(reinterpret_cast<const __m256i*>(inbuffer+i)); // aligned load __m256 vval0 = _mm256_cvtepi32_ps(vin0); vval0 = _mm256_mul_ps(vval0, vscale); _mm256_store_ps(outbuffer+i, vval0); // aligned store } _mm256_zeroupper(); } else { const __m128 vscale = _mm_set1_ps(1/fscale); (uint = 0; < samplecount; i+=4) { __m128i vin0 = _mm_load_si128(reinterpret_cast<const __m128i*>(inbuffer+i)); // aligned load __m128 vval0 = _mm_cvtepi32_ps(vin0); vval0 = _mm_mul_ps(vval0, vscale); _mm_store_ps(outbuffer+i, vval0); // aligned store } } }
so run start timer, run convertpcm32floattopcm32fixed convertpcm32fixedtopcm32float convert straight back, end timer. sse2 versions of functions execute total of 15-16 microseconds, xvx versions take 22-23 microseconds. bit perplexed, dug bit further, , have discovered how speed avx versions go faster sse2 versions, it's cheating. run convertpcm32floattopcm32fixed before starting timer, start timer, , run convertpcm32floattopcm32fixed again, convertpcm32fixedtopcm32float, stop timer. if there's massive penalty sse avx, if "prime" avx version first trial run, avx execution time drops 12 microseconds, while doing same thing sse equivalents drops time down microsecond 14, making avx marginal winner here, if cheat. considered maybe avx doesn't play nicely cache sse, using _mm_prefetch nothing either.
am missing here?
i did not test code, since test appears quite short, maybe you're seeing floating point warm-up effect agner fog discusses on p.101 of microarchitecture manual (this applies sandy bridge architecture). quote:
the processor in cold state when has not seen floating point instructions while. latency 256-bit vector additions , multiplications 2 clocks longer ideal number, 1 clock longer, , after several hundred floating point instructions processor goes warm state latencies 3 , 5 clocks respectively. throughput half ideal value 256-bit vector operations in cold state. 128-bit vector operations less affected warm-up effect. latency of 128-bit vector additions , multiplications @ 1 clock cycle longer ideal value, , throughput not reduced in cold state.
Comments
Post a Comment