Context-Aware Fast 3D DCT/IDCT Algorithm for Low-power Video Codec in Mobile Embedded Systems

STREAMING DAY 2010 UDINE Context-Aware Fast 3D DCT/IDCT Algorithm for Low-power Video Codec in Mobile Embedded Systems Sergio Saponara, Luca Fanucci University of Pisa, Italy Contact: sergio.saponara@iet.unipi.it STDAY2010, Udine, Sept. 2010

Outline • Application of Multidimensional DCT in video coding • Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results • VLSI Architectures for 3D DCT • CMOS implementation results • Conclusion STDAY2010, Udine, Sept. 2010

Application of Multidimensional DCT in video coding • Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results • VLSI Architectures for 3D DCT • CMOS implementation results • Conclusion STDAY2010, Udine, Sept. 2010

2D DCT for video coding 2D DCT allows for the reduction of spatial data redundancy - Conventional algorithm adopted in H.26X (by ITU-T) and MPEGX (by ISO/IEC) video CoDec (encoder/decoder) - 2D DCT is applied to image blocks of NxN pixels (usually N=8) Core of motion-compensated H.26x Encoder STDAY2010, Udine, Sept. 2010

3D DCT for video coding (1/2) 3D DCT extends the spatial compression properties to time With respect to H.26x/MPEGx CoDecs 3D DCT offers: - Lower cost: 3D DCT (spatio-temporal compression) instead of 2D DCT (for spatial compression) plus motion estimation (for temporal compression) - Symmetric complexity of decoder and encoder much lower than motion-compensated H.26x/MPEGx encoders - Optimal solution for applications requiring real-time coding/decoding in the same terminal: interactive TV and web services, video telephony, video conferencing, face recognition - Same coding efficiency for slow motion videos or small/medium image formats; higher error-resilience STDAY2010, Udine, Sept. 2010

3D DCT for video coding (2/2) 3D DCT is applied to cube of NxNxN pixels (usually N=8) As in H.26X/MPEGx each frame of a video is divided in blocks 1 Cube NxNxN = image blocks of NxN pixels belonging to N consecutive frames STDAY2010, Udine, Sept. 2010

3D-DCT radix-factorization (1/2) • Equation of a N3-point 3D DCT • A direct implementation of the equation requires N3 multiplications and additions (MAC) • The N3-point 3D DCT is implemented by 3 N-point 1D DCT plus proper transposition matrixes • Complexity of 3NMAC • Memory cost: T1 of N2 words plus T2 of N3 words STDAY2010, Udine, Sept. 2010

Blocks 0,..,N-1 1D DCT T1 1D DCT T2 1D DCT 3D-DCT radix-factorization (2/2) • Each N-point 1D-DCT is factorized in simpler radix-2 butterflies STDAY2010, Udine, Sept. 2010

3D-DCT/IDCT data correlation Switching bits between consecutive input samples With MissAmerica, Akiyo, Foreman, Coastguard up to 60-70% of the rows are null in IDCT mode Distribution of the amplitude of AC coefficients for Foreman vs. the coefficient number (1 to 512 in the 8x8x8 cube) STDAY2010, Udine, Sept. 2010

Context-aware 3D-DCT • Insert before a 1D stage a pre-processor that for each row Xi of N samples: • analyzes the statistics of the DCT/IDCT input samples in each computing stage • based on heuristic rules decides if the DCT/IDCT computation can be avoided • If A = 0 and SAD = 0 or If A ≠ 0 and SAD<TH1 the transform result is forced to zero • In these cases the transform result is estimated to have a small residual energy and most likely would be cancelled by the quantizer STDAY2010, Udine, Sept. 2010

% Computation saving Context-aware vs. classic 3D DCT/IDCT STDAY2010, Udine, Sept. 2010

Rate-distortion performance of context-aware fast 3D DCT Rate-distortion curve for Akiyo PSNR variation at fixed bit-rate: context-aware vs. classic 3D DCT STDAY2010, Udine, Sept. 2010

Why VLSI HW design? A SW optimized design of a 3D DCT/IDCT reaches real-time time VGA 24 Hz on Intel Core 2 6300@1.86 GHz [T. Fryza et al.] The Core 2 6300 processor, in 65 nm CMOS, integrates two cores and up to 4 MB of L2 cache. The die size is 143 mm2 for 290 M transistors; at 1.86 GHz the power consumption is up to 65 W For battery-powered terminals VLSI HW design is needed STDAY2010, Udine, Sept. 2010

Distribuited ArithmeticRAC (ROM+Accumulator) instead of a (Multiplier + Accumulator) 1D-DCT circuit engine STDAY2010, Udine, Sept. 2010

Blocks 0,..,N-1 T1 T2 1 D DCT 1 D DCT 1 D DCT T2 Blocks 0, .., N-1 1 D DCT/IDCT 3D-DCT architectures: schemes 3D architectures with different degrees of parallelism and power vs. area trade-offs FULL PARALLEL (PA) CASCADE (CS) ITERATIVE (IT) STDAY2010, Udine, Sept. 2010

3D-DCT architectures: performance and complexity STDAY2010, Udine, Sept. 2010

CMOS implementation results0.18 m, 1.6 V, 6 metal levels standard-cell QCIF 4 CIF 16 CIF CIF 250 200 FULL PARALLEL CASCADE 150 ITERATIVE 100 50 0 0 15 30 45 60 75 Circuit complexity (Kgates) vs. Power consumption (mW) Dotted lines refer to the elaboration of the same video formats STDAY2010, Udine, Sept. 2010

Power consumption with context-aware saving Power consumption of the CS architecture Power consumption of the PA architecture STDAY2010, Udine, Sept. 2010

Conclusions 3D-DCT/IDCT is a promising solution for real-time, low-power, low-complexity, Implementation of video encoders and decoders in battery powered terminals STDAY2010, Udine, Sept. 2010

Thanks for your attention!!! STDAY2010, Udine, Sept. 2010

Context-Aware Fast 3D DCT/IDCT Algorithm for Low-power Video Codec in Mobile Embedded Systems