110 likes | 261 Views
Accelerating Multimedia Applications using the Intel SSE and AVX ISA. Min Li 05/08/2013. Intel SSE and AVX ISA. Intel ISA SSE1, SSE2, SSE3, SSE4 (SSE4.1, SSE4.2)
E N D
Accelerating Multimedia Applications using the Intel SSE and AVX ISA Min Li 05/08/2013
Intel SSE and AVX ISA • Intel ISA • SSE1, SSE2, SSE3, SSE4 (SSE4.1, SSE4.2) • SSE4.2 Specialized for String and Text applications (suitable for applications like template matching, Genome Sequence Comparison) • AVX (mainly for floating point operations) • AVX1: 256bits • AVX2: 256bits (with some instructions extension) • XMM register and YMM register • XMM: 128bits • YMM: 256bits
Intel OpenCV Library • OpencvLibrary • Various of multimedia applications • Object detection, face recognition, image processing… • Good candidates for using Intel SSE or AVX ISA for speedup • Intensive computations • I made a video on Youtube to show some tricks in using Opencvlibrary https://www.youtube.com/watch?v=ISap9zEGE2I https://www.youtube.com/watch?v=pqSgT0quMBc
guidelines for enabling the ISA • Intel SSE and AVX • cat /proc/cpuinfoMake sure SSE and AVX are enabled. Otherwise enable them. • As you can see • All SSE ISA are activated • However only AVX1 is activated, which means I can only use 128bits XMM registers • Note: AVX2 is released in the mid of 2012
Intel OpenCV Library • OpencvLibrary • Various of multimedia applications • Object detection, face recognition, image processing…
Acceleration Case I After modification: int chunk = length / 4; for(i= 0; i < chunk; i++){ __m128 m0, m1; m0 = _mm_load_ps(&d1[4 * i]); m1 = _mm_load_ps(&d2[4 * i]); m1 = _mm_sub_ps(m0, m1); m1 = _mm_mul_ps(m1, m1); m1 = _mm_hadd_ps(m1, m1); m2 = _mm_shuffle_ps(m1, m1, _MM_SHUFFLE(2,3,0,1)); m1 = _mm_add_ps(m1, m2); total_cost+= ((float*)&m1)[0]; if( total_cost > best ) break; } Original: for( inti = 0; i < length; i += 4 ){ double t0 = d1[i] - d2[i]; double t1 = d1[i+1] - d2[i+1]; double t2 = d1[i+2] - d2[i+2]; double t3 = d1[i+3] - d2[i+3]; total_cost += t0*t0 + t1*t1 + t2*t2 + t3*t3; }
Acceleration Case II After modification : __mm128 m0, m1, m2, m3, m4, minArray, maxArray; intchunk = N / 4; for(i= 1; i < chunk; i++){ m0 = __mm_load_ps( (const float*)it.ptr ); it += 4; m1 = _mm_min_ps(m0, minArray); m2 = _mm_max_ps(m0, maxArray); m3 = _mm_cmp_ps(m0, minArray, _CMP_LT_OS); m4 = _mm_cmp_ps(m0, maxArray, _CMP_GT_OS); int* mask1 = (int*) &m3; int* mask2 = (int*) &m4; for(intj = 0; j < 4; j++){ if(mask1[j] == -1) minPos[j] = 4 * i + j; if(mask2[j] == -1) maxPos[j] = 4 * i + j; } minArray= m3; maxArray= m4; } Original: float minval = FLT_MAX, maxval = -FLT_MAX; for( i = 0; i < N; i++, ++it ) { float v = *(const float*)it.ptr; if( v < minval ) { minval = v; minidx = it.node()->idx; } if( v > maxval ) { maxval = v; maxidx = it.node()->idx; } } if( _minval ) *_minval = minval; if( _maxval ) *_maxval = maxval;
Load of Structures point* points; • Structues like this : typedef point_{ int x; int y; } point; • _mm_load_ only takes consecutive mem space! • What is it like insider the XMM register? • How to achieve the following using SSE && AVX ISA? points[0].x points[0].y points[1].x points[1].y . . . X1 Y0 X1 Y1 Y2 Y3 Y3 X0 X0 Y0 X2 X3 Y2 Y1 X3 X2 Not easy!!!
permute and blend • __m256i temp = _mm256_load_si256((__m256i*) &points[4 * i]); • __m256 temp2 = _mm256_cvtepi32_ps(temp); • v4si mask1 = {9,8,8,9}; • __m256 temp3 = _mm256_permutevar_ps(temp2, mask1); • __m256 temp4 = _mm256_permute2f128_ps(temp3, temp3, 0x01); • temp3 = _mm256_blend_ps(temp3, temp4, 0b00110011); • v4si mask2 = {0xd,4,4,0xd}; • temp3 = _mm256_permutevar_ps(temp2, mask2); • __m128 m1 = _mm256_extractf128_ps(temp3, 1); • __m128 m2 = _mm256_extractf128_ps(temp3, 0); Y0 X1 X1 Y3 X1 X2 Y0 X1 X2 X2 Y2 Y1 Y2 Y3 X1 X3 Y1 Y3 Y3 Y1 X0 X0 Y2 X0 X0 Y2 X2 X0 Y2 Y0 Y2 X2 Y0 X3 Y0 Y1 X3 X3 X3 Y1 Y1 X1 Y0 X0 X3 Y3 Y2 X2
Simulation Results Too many overhead for loading structures Not only finding min/max, but also the position
Conclusion and future work • Opencv suitable for SSE or AVX acceleration • Single task has more chance to get speedup • Loading and arranging a structure is really a cumbersome task • Hints for smart automated compilation (such as loading structure) • Suggestions for the expansion of the ISA (new instruction introduced)