Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu

Outline • Introduction • MMX/SSE/SSE2 • MPEG 2 Video Compression • What we have done? • Conclusion

MMX/SSE/SSE2 • MMX • 57 new instructions; • 8 64-bit wide MMX registers; • 4 new data types. (3 packed data type and 1 64-bit entity) • SSE • 8 new 128-bit SIMD floating-point registers; • 50 new instructions that work on packed floating-point data; • 8 new instructions to control data cacheability; • 12 new instructions that extend the MMX instruction set. • SSE2 • Support 64-bit floating-point values

MPEG 2 video compression

1. Dig out a MPEG2 Enc/Dec C code 2. Generate profiling information 3. Identify the kernels 4. Rewrite kernels using SSE 5. Performance results Project outline

Profiling results of the original code mpeg2decode mpeg2encode idct() dist1() fdct()

Example 1 – optimizing dist1() if ((v = p1[0] - p2[0])<0) v = -v; s+= v; if ((v = p1[1] - p2[1])<0) v = -v; s+= v; if ((v = p1[2] - p2[2])<0) v = -v; s+= v; if ((v = p1[3] - p2[3])<0) v = -v; s+= v; if ((v = p1[4] - p2[4])<0) v = -v; s+= v; if ((v = p1[5] - p2[5])<0) v = -v; s+= v; if ((v = p1[6] - p2[6])<0) v = -v; s+= v; if ((v = p1[7] - p2[7])<0) v = -v; s+= v; if ((v = p1[8] - p2[8])<0) v = -v; s+= v; if ((v = p1[9] - p2[9])<0) v = -v; s+= v; if ((v = p1[10] - p2[10])<0) v = -v; s+= v; if ((v = p1[11] - p2[11])<0) v = -v; s+= v; if ((v = p1[12] - p2[12])<0) v = -v; s+= v; if ((v = p1[13] - p2[13])<0) v = -v; s+= v; if ((v = p1[14] - p2[14])<0) v = -v; s+= v; if ((v = p1[15] - p2[15])<0) v = -v; s+= v; asm volatile (" movdqu (%1), %%XMM0 movdqu (%2), %%XMM1 psadbw %%XMM0, %%XMM1 movdq2q %%XMM1, %%MM0 pslldq $8, %%XMM1 movdq2q %%XMM1, %%MM1 paddd %%MM1, %%MM0 movd %%MM0, %0" : "=r"(s) : "r"(p1), "r"(p2)); 4-5X speed-up, but it can be faster! This code segment is for calculating residual matrices in the prediction stage in Encoder

Four ways to write super-fast code • Rearrange data fetching to maximize cache hit; • Unroll loops to eliminate unnecessary branches; • Utilize SSE instructions to take full advantage of parallelism; • Apply code scheduling to exploit multiple issue capability of Pentium 4's superscalar micro- architecture.

Example 2 – optimize idct() Three nested loops forms the kernel of DCT: for (i=0; i<8; i++) for (j=0; j<8; j++) { partial_product = 0.0; for (k=0; k<8; k++) partial_product+= c[k][j]*block[i][k]; tmp[i][j] = partial_product; }

A verbatim translation from C to assembly doesn’t do much better. It misses the whole point of manually writing an assembly procedure.

We need parallelism!

Results 68.72% 50.1s 25X in idct() 4X in dist1() 34.39% 16.34s 13.04% 9.99% 2.45s 3.83s Experimental Results are averaged over 3 runs.

Platform Compatibility (1) Algorithm for Checking Availability of MMX bool isMMXSupported() { int fSupported; asm { mov eax,1 // CPUID level 1 cpuid // EDX = feature flag and edx,0x800000 // test bit 23 of feature flag mov fSupported,edx // != 0 if MMX is supported} if (fSupported != 0) return true; else return false; }

Y SSE? SSE Routine N MMX Routine MMX? Y N Normal Routine END Platform Compatibility (2) Algorithm for Checking Availability of SSE bool isISSESupported() { int processor; int features; int extfeatures = 0; asm{ pusha mov eax,1 cpuid mov processor,eax // Store processor family/model/step mov features,edx // Store features bits mov eax,080000000h cpuid // Check which extended functions can be called cmp eax,080000001h // Extended Feature Bits jb nofeatures // Jump if not supported mov eax,080000001h // Select function 0x80000001 cpuid mov extfeatures,edx // Store extended features bits nofeatures: popa } if (((features $>>$ 25) \& 1) != 0) return true; else if (((extfeatures $>>$ 22) \& 1) != 0) return true; else return false; }

Thank you!

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology