Accelerating Multimedia Applications using the Intel SSE and AVX ISA

Accelerating Multimedia Applications using the Intel SSE and AVX ISA Min Li 05/08/2013

Intel SSE and AVX ISA • Intel ISA • SSE1, SSE2, SSE3, SSE4 (SSE4.1, SSE4.2) • SSE4.2 Specialized for String and Text applications (suitable for applications like template matching, Genome Sequence Comparison) • AVX (mainly for floating point operations) • AVX1: 256bits • AVX2: 256bits (with some instructions extension) • XMM register and YMM register • XMM: 128bits • YMM: 256bits

Intel OpenCV Library • OpencvLibrary • Various of multimedia applications • Object detection, face recognition, image processing… • Good candidates for using Intel SSE or AVX ISA for speedup • Intensive computations • I made a video on Youtube to show some tricks in using Opencvlibrary https://www.youtube.com/watch?v=ISap9zEGE2I https://www.youtube.com/watch?v=pqSgT0quMBc

guidelines for enabling the ISA • Intel SSE and AVX • cat /proc/cpuinfoMake sure SSE and AVX are enabled. Otherwise enable them. • As you can see • All SSE ISA are activated • However only AVX1 is activated, which means I can only use 128bits XMM registers • Note: AVX2 is released in the mid of 2012

Intel OpenCV Library • OpencvLibrary • Various of multimedia applications • Object detection, face recognition, image processing…

Acceleration Case I After modification: int chunk = length / 4; for(i= 0; i < chunk; i++){ __m128 m0, m1; m0 = _mm_load_ps(&d1[4 * i]); m1 = _mm_load_ps(&d2[4 * i]); m1 = _mm_sub_ps(m0, m1); m1 = _mm_mul_ps(m1, m1); m1 = _mm_hadd_ps(m1, m1); m2 = _mm_shuffle_ps(m1, m1, _MM_SHUFFLE(2,3,0,1)); m1 = _mm_add_ps(m1, m2); total_cost+= ((float*)&m1)[0]; if( total_cost > best ) break; } Original: for( inti = 0; i < length; i += 4 ){ double t0 = d1[i] - d2[i]; double t1 = d1[i+1] - d2[i+1]; double t2 = d1[i+2] - d2[i+2]; double t3 = d1[i+3] - d2[i+3]; total_cost += t0*t0 + t1*t1 + t2*t2 + t3*t3; }

Acceleration Case II After modification : __mm128 m0, m1, m2, m3, m4, minArray, maxArray; intchunk = N / 4; for(i= 1; i < chunk; i++){ m0 = __mm_load_ps( (const float*)it.ptr ); it += 4; m1 = _mm_min_ps(m0, minArray); m2 = _mm_max_ps(m0, maxArray); m3 = _mm_cmp_ps(m0, minArray, _CMP_LT_OS); m4 = _mm_cmp_ps(m0, maxArray, _CMP_GT_OS); int* mask1 = (int*) &m3; int* mask2 = (int*) &m4; for(intj = 0; j < 4; j++){ if(mask1[j] == -1) minPos[j] = 4 * i + j; if(mask2[j] == -1) maxPos[j] = 4 * i + j; } minArray= m3; maxArray= m4; } Original: float minval = FLT_MAX, maxval = -FLT_MAX; for( i = 0; i < N; i++, ++it ) { float v = *(const float*)it.ptr; if( v < minval ) { minval = v; minidx = it.node()->idx; } if( v > maxval ) { maxval = v; maxidx = it.node()->idx; } } if( _minval ) *_minval = minval; if( _maxval ) *_maxval = maxval;

Load of Structures point* points; • Structues like this : typedef point_{ int x; int y; } point; • _mm_load_ only takes consecutive mem space! • What is it like insider the XMM register? • How to achieve the following using SSE && AVX ISA? points[0].x points[0].y points[1].x points[1].y . . . X1 Y0 X1 Y1 Y2 Y3 Y3 X0 X0 Y0 X2 X3 Y2 Y1 X3 X2 Not easy!!!

permute and blend • __m256i temp = _mm256_load_si256((__m256i*) &points[4 * i]); • __m256 temp2 = _mm256_cvtepi32_ps(temp); • v4si mask1 = {9,8,8,9}; • __m256 temp3 = _mm256_permutevar_ps(temp2, mask1); • __m256 temp4 = _mm256_permute2f128_ps(temp3, temp3, 0x01); • temp3 = _mm256_blend_ps(temp3, temp4, 0b00110011); • v4si mask2 = {0xd,4,4,0xd}; • temp3 = _mm256_permutevar_ps(temp2, mask2); • __m128 m1 = _mm256_extractf128_ps(temp3, 1); • __m128 m2 = _mm256_extractf128_ps(temp3, 0); Y0 X1 X1 Y3 X1 X2 Y0 X1 X2 X2 Y2 Y1 Y2 Y3 X1 X3 Y1 Y3 Y3 Y1 X0 X0 Y2 X0 X0 Y2 X2 X0 Y2 Y0 Y2 X2 Y0 X3 Y0 Y1 X3 X3 X3 Y1 Y1 X1 Y0 X0 X3 Y3 Y2 X2

Simulation Results Too many overhead for loading structures Not only finding min/max, but also the position

Conclusion and future work • Opencv suitable for SSE or AVX acceleration • Single task has more chance to get speedup • Loading and arranging a structure is really a cumbersome task • Hints for smart automated compilation (such as loading structure) • Suggestions for the expansion of the ISA (new instruction introduced)

Accelerating Multimedia Applications using the Intel SSE and AVX ISA

Accelerating Multimedia Applications using the Intel SSE and AVX ISA

Presentation Transcript

Intel Multimedia Extensions and Hyper-Threading

Multimedia Applications

MULTIMEDIA APPLICATIONS

Multimedia applications and Optical networks

Applications of Multimedia

Accelerating Machine Learning Applications using Delite

Home Exam 1 : Codec c63 on Intel x86 using Streaming SIMD Extensions (SSE)

Accelerating Applications using HPC Server 2008

Multimedia applications and Optical networks

Accelerating PHP Applications

Home Exam 1 : Video Encoding on Intel x86 using Streaming SIMD Extensions (SSE)

Accelerating PHP Applications

Multimedia Applications

Multimedia Applications

Multimedia Applications and Production

Multimedia: Applets and Applications

Accelerating MFIX-DEM code on the Intel Xeon Phi

Multimedia applications and end systems

Multimedia Applications

Accelerating PHP Applications