310 likes | 448 Views
Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder. Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2005. Outline. Introduction H.264/AVC Intra Coding Computation Reduction
E N D
Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2005
Outline • Introduction • H.264/AVC Intra Coding • Computation Reduction • Hardware Architecture
Input Video Signal Split into Macroblocks 16x16 pixels Coder Control Control Data Transform/Scal./Quant. Quant.Transf. coeffs - Decoder Scaling & Inv. Transform Entropy Coding De-blocking Filter Intra-frame Prediction Output Video Signal Motion- Compensation Intra/Inter Motion Data Motion Estimation Introduction Multiple Reference Frames & Variable Block sizes
Introduction Compressed Data Source Prediction Transform Quantization Entropy Coding 44/1616 Luma 88 Chroma 4 4 DCT Scalar Nonuniform Q CAVLC CABAC lossy lossless (Bit per pixel)
DWT 53 Introduction • H.264/AVC I-Frame Coder (CAVLC) vs. JPEG2000 (DWT 53) • Computational Complexity • Block-based coding vs. Frame-based coding Hardware-friendly Memory-wasting
Introduction • Comparison between different image coding standards JPEG 2000 DWT53 H.264 I-Frame CAVLC JPEG 0.225 bpp
Introduction • Two solutions for platform-based design of H.264/AVC intra frame coder • Fast algorithm for software implementation • Reduce 45% complexity • PSNR drop 0.3 dB • Hardware accelerator • Max. clock rate 55 MHz • 31 fps for 4:2:0 SDTV (All intra frames)
8 1 6 3 4 7 5 0 H.264/AVC Intra Coding • Intra Prediction • I4MB (44) • I16MB (1616) Current + DC + DC + Plane 1 0
H.264/AVC Intra Coding • Mode Decision • Low complexity mode • SATD (Original pels – Predictors) • Rate (bit of Mode information) • High complexity mode • MSE (Original pels – Reconstructed pels) • Rate (Mode information + Residual)
H.264/AVC Intra Coding • Transform and Quantization • 4 4 integer transform Hadamard transform DCT-based integer transform
H.264/AVC Intra Coding • Entropy Coding • Context-Based Adaptive Binary Arithmetic Coding (CABAC) • Context-Based Adaptive Variable Length Coding (CAVLC)
H.264/AVC Intra Coding • Run-time percentage • 720 480 4:2:0 30fps • 10829 MIPS
Computation Reduction • Intra Prediction • Table look-up • Cost generation • Sub-sampling
Computation Reduction • Fast Intra Prediction • The smaller the mode number is, the more possible it will occur. • global statistics cannot reflect the correlation of local modes. • Local statistics of neighboring blocks are applied.
Computation Reduction • Fast Intra Prediction • Skip unlikely candidates
Computation Reduction • Rate-distortion under different numbers of local-searched I4MB modes without insertion of full-search blocks 6 4 1 All DC modes 2
Computation Reduction • Fast Intra Prediction • Prevention of error propagation • Periodic insertion of full-search 4x4 blocks • Adaptive threshold on the distortion for a MB • If min SATD of P > THMinSATD, then search all modes. • THMinSATD = (min SATD of F) • = 2.0 F P F P P P P P F P F P P P P P
Computation Reduction • Subsampling Patterns
Computation Reduction • Saved Computation and PSNR Drop PSNR drop < 0.3 dB Global: subsampling + partial search using global statistics Local: subsampling + partial search Proposed: subsampling + partial search + periodic insertion of full search + adaptive SATD threshold
Hardware Architecture • Assumptions • A RISC can execute one instruction per cycle, except multiplication requiring two. • A processing element (PE) can generate predictors of one pixel per cycle.
Hardware Architecture • Solutions luma chroma Produce all modes per cycle Produce one mode per cycle 30fps # of modes Avg. cycles per predictors
Hardware Architecture • Comparisons in different degrees of parallelism
M A B C D E F G H I J K L Hardware Architecture DRAM Register
Hardware Architecture • Four-Parallel Reconfigurable Intra Prediction Generator 8-bit adder 9-bit adder
M A B C D E F G H I J K L Hardware Architecture • Intra Prediction Generator
Hardware Architecture Top PE0 PE1 PE2 PE3 Cycle 1: T0+T4+T8+T12 Cycle 1: T1+T5+T9+T13 Cycle 1: T2+T6+T10+T14 Cycle 1: T3+T7+T11+T15 Cycle 2: +L0+L4+L8 Cycle 2: +L0+L5+L9 Cycle 2: +L2+L6+L10 Cycle 2: +L3+L7+L11 Cycle 3: +L12 Cycle 3: +L13 Cycle 3: +L14 Cycle 3: +L15 Left Cycle 4: +++ I16MB DC Prediction Mode
A0 A1 A2 A3 Hardware Architecture • I16MB Plane Prediction Mode Pred[y, x] = Clip1((a + b (x – 7) + c (y – 7) >> 5) a = 16 (p[-1, 15] + p[15, -1]) b = (5 H + 32) >> 6 c = (5 V + 32) >> 6 H = 7x’=0 (x’+1) (p[-1, 8+x’] – p[-1, 6 – x’]) V = 7x’=0 (y’+1) (p[8+y’, -1] – p[6 – y’, -1]) Pred[0,0] Pred[0,8] Pred[0,4] Pred[0,12]
Hardware Architecture A0 A1 A2 A3
Hardware Architecture • Transform (Implemented by shifters and adders) DCT iDCT Hadamard