370 likes | 649 Views
H.264 Intra Frame Coder System Design. Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005. OUTLINE. Introduction Hardware Arch i tectures For Intra Frame Code r Modules Top Level Intra Frame Coder Hardware H.264 Intra Frame Coder System
E N D
H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005
OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work
Standards H.263 H.263+ H.263++ H.261 ITU-T MPEG-1 MPEG-4 MPEG Joint ITU-T / MPEG H.262 / MPEG-2 H.264 / MPEG-4 Part 10 1984 1985 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 Years H.264 VIDEO CODING STANDARD • The latest video coding standard • Developed with the collaboration of ITU-T and MPEG • Includes 3 Profiles and 14 Levels
Coder MPEG-4 ASP H.263 HLP MPEG-2 H.264 VIDEO CODING STANDARD H.264 38.62% 48.80% 64.46% 3.0 2025 : MPEG-2 : MPEG-4 (ASP) : H.264 1.8 1234 1.1 727 386 235 139 Bandwidth Required (Mbps) Storage Utilization (MB) Download Time (Minutes) It Provides Significant Performance Gains Average Bit Rate Savings 90-minute DVD-quality movie (Download time at 700 Kbps)
H.264 Encoder Block Diagram Current Frame Motion Estimation Residue Reference Frame + Transform Quant Reorder - Motion Compensation Entropy Coder Mode Decision Choose Intra Mode Intra Prediction + + Deblocking Filter Inverse Transform Inverse Quant Reconstructed Frame Reconstruction Intra Frame Coder
OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work
Transform and Quantization Algorithms Forward Transform Quantizer Residue VLC HadamardTransform Inverse HadamardTransform Inverse Transform Inverse Quantizer Reconstruction
H.264 Transform Algorithm • A multiply-free 4x4 integer transformis used.It only requires additions and shifts. • For 16x16 intra coded luminance blocks and for 8x8 chrominance blocks a second transform, Hadamard Transform, is applied on DC coefficients. 4x4 Forward Integer Transform 4x4 Hadamard Transform 2x2 Hadamard Transform 4x4 Inverse IntegerTransform
H.264 Transform Algorithm -1 16 17 LUMA 0 1 4 5 18 19 22 23 2 3 6 7 20 21 24 25 CHROMACB CHROMA CR 8 9 12 13 10 11 14 15 • 4x4 Forward Integer Transform is applied to all the blocks except –1, 16, 17 • 4x4 Hadamard Transform is applied to –1 if intra 16x16 mode is selected • 2x2 Hadamard Transform is applied to 16, 17
Transform Hardware Register 0 stores: (x0+x4+x8+x12) Register 1 stores: (x1+x5+x9+x13) Register 2 stores: (x2+x6+x10+x14) Register 3 stores: (x3+x7+x11+x15) Pipelining Registers are used to increase the maximum clock frequency Register 4 stores the result of transform operations (x0+x4+x8+x12) + (x1+x5+x9+x13) + (x2+x6+x10+x14) + (x3+x7+x11+x15) 2*(x0+x4+x8+x12) + (x1+x5+x9+x13) - (x2+x6+x10+x14) - 2*(x3+x7+x11+x15) (x0+x4+x8+x12) - (x1+x5+x9+x13) - (x2+x6+x10+x14) + (x3+x7+x11+x15) (x0+x4+x8+x12) - 2*(x1+x5+x9+x13) + 2*(x2+x6+x10+x14) - (x3+x7+x11+x15)
Quantization Hardware QP ranges from 0 to 51. qbits = 15+floor(QP/6) AC Coefficients : |Zij| = (|Wij|.MF + f) >> qbits, sign(Zij) = sign(Wij) DC Coefficients : |Zij| = (|Yij|.MF + 2f) >> (qbits + 1), sign(Zij) = sign(Yij) Inverse Quantization AC Coefficients : W’ij = Zij.V.2floor(QP/6) DC Coefficients : If QP > 12W’ij = Wqij.V.2floor(QP/6) - 2 ElseW’ij = [ Wqij.V + 21 - floor(QP/6) ] >> (2-floor (QP/6))
0.18µ ASIC implementation Excluding I/O Register Files Critical Path Delay [ns] Including I/O Register Files GateCount Hardware Implementation Results FPGA implementation Function Generators 2497 4054 Transform part of the Datapath 2.77 1978 CLB Slices 1249 2027 Dffs or Latches 581 583 Block Multipliers 1 1 Datapath 4.78 12773 Datapath + Control Unit 4.8 23162 Datapath + Control + Input Register File + Output Register File TQ 4.8 130505 In the worst case, it takes 2500 cycles to complete the TQIQIT operations of a 4x4 block FPGA implementation works at 81MHz and it can code 27 VGA frames per second 0.18µ ASIC implementation works at 210MHz and it can code 70 VGA frames per second
Context Adaptive Variable Length Encoder Hardware 1) After prediction, transformation and quantization, blocks typically contain zeros and ones 2) The highest non-zero coefficients after the zig-zag scan are often sequences of +/-1. 3) The number of non-zero coefficients in neighbouring blocks are correlated 4)The magnitude of non-zero coefficients tends to be higher at the start
Intra Prediction Hardware • 9 prediction modes for 4x4 luma blocks • 4 prediction modes for 16x16 luma and 8x8 chroma blocks Reconstructed Pixels Inputs from Top-Level Address Generation Hardwares Neigbouring Buffers Top Level Mode Controller Reconstructed Pixels Internal Buffers Controller for 4x4 Luma Prediction Modes Datapath for 4x4 Luma Prediction Modes Controller for 16x16 Luma Prediction Modes Datapath for 16x16 Luma Prediction Modes Prediction Buffer (384x8) Output MUX Controller for 8x8 Chroma Prediction Modes Datapath for 8x8 Chroma Prediction Modes
OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work
Level @30Mhz @40Mhz @50Mhz @60Mhz @70Mhz @80Mhz Top Level Intra Frame Coder Hardware 2.0 (CIF @30 fps) 2525 3367 4208 5050 5892 6734 Input Register File SEARCH HARDWARE Pipelining Register File CODER HARDWARE Output Register File Functional Units 1st MB 2nd MB Search Hardware 3rd MB 4th MB Coder Hardware Time (cycles) 4000 8000 12000 16000 CIF @ 30 fps requires processing 11800 Macroblocks per second
Search Hardware Mux 384 x 8 Current MB Reg. for 16 DC coefs. Neighbors Hadamard Transform Luma 16x16 Chroma 8x8 Intra Pred. Residue 384 x 8 Predicted MB Mode Decision QP Mode 256 x 8 Current MB Neighbors Luma 4x4 Hadamard Transform Intra Pred. Residue 256 x 8 Predicted MB
Mode Decision Cost4x4 Cost16x16 << 3 18 18 9 Mux 18 Add_sub Add/Sub 19 Register 19 Result SATD based mode decision algorithm Intra 4x4 vs Intra 16x16 Cost Comparator 1) Compute the cost of each 4x4 mode Select the 4x4 mode with lowest cost 2) Compute the cost of each 16x16 mode Select the 16x16 mode with lowest cost 3) Compute the cost of each 8x8 mode Select the 8x8 mode with lowest cost 4) Compare selected 4x4 and 16x16 costs and select the best mode 5) Start the coder hardware with selected mode information • Cycle: Register = 8 x • 2.Cycle: Register = 16 x • Cycle: Register = 24 x • Cycle: Register = 4x4cost + 24 x • Cycle: Register = 16x16cost – (4x4cost + 24 x )
High Speed Hadamard Transform Hardware z1 z2 z3 z5 z6 z7 z9 z10 z11 z13 z14 z15 z0 z4 z8 z12 add/sub add/sub add/sub add/sub add/sub add/sub add/sub add/sub add/sub add/sub add/sub add/sub add/sub add/sub P. Register P. Register add/sub add/sub Register • Performs SATD computation • Reguires only 18 cycles for a 4x4 Block • 13-bit adders/subtractors • Two-stage pipeline
Coder Hardware 384 x 8 Current MB Transform 384 x 16 Reg. file 384 x 9 Reg. file Quant Residue HT IHT 384 x 8 Predicted MB CAVLC InverseQuant Inverse Transform 192 x 32 Reg. File 16 x 16 Reg. File Intra Pred. Reconstruct Bitstream 384 x 8 Reconstructed MB
Scheduling of Intra 4x4 modes Modules Intra Prediction Residue TQ IQIT TQ IQIT 1st Block TQIQIT 2nd Block CAVLC Reconstruction 42 86 142 160 24 202 246 0 302 320 Time (cycles) Worst Case cycle counts required to complete a 4x4 block : TQIQIT = 100, CAVLC = 120, Residue&Reconstruction = 18, Intra Prediction = 24
Scheduling of Intra 16x16 modes Modules Intra Prediction Residue 1st Block 2nd Block TQ TQ TQ HT IQIT IQIT TQIQIT 16th Block CAVLC Reconstruction 42 75 48 86 130 384 402 24 0 746 800 860 880 920 1040 Time (cycles)
Resources Used Available Utilization IOs 418 1108 37.73% Global Buffers 2 16 12.50% Function Generators 21404 93184 22.97% CLB Slices 10702 46592 22.97% Dffs or Latches 3881 96508 4.02% Block RAMs 1 168 0.60% Block Multipliers 1 168 0.60% Implementation Results for H.264 Intra Frame Coder Hardware • Synthesized at 61.4MHz and Placed & Routed at 53.8MHz. • The total equivalent gate countis 1,051,458 Device Utilizations for XC2V8000 FPGA
OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work
System Overview • PC is used to develop Verilog modules and debug the system • Multi Ice Debugger communicates with the development board • Development Board is used for testing the designed hardware • Color LCD Panel is used for visual verification
ARM-based Development Platform Logic Tile Xilinx Virtex II 8000 FPGA Arm 926EJ-S Processor based Development Chip • Versatile Platform Baseboard Xilinx Virtex II 2000 FPGA
Software Implementation • Matlab and C codes are developed • ARM AXD Tool is used to debug the system • C codes run on ARM926EJ-S processor • SRAM available on Logic Tile is used to store image data Converting the image from RGB format to YCbCr format Partitioning the image into macroblocks 4:2:0 Sampling Capturing the image in RGB format SRAM H.264 Intra Frame Coder Hardware SRAM Converting the image from YCbCr format to RGB format Reconstructing the image in raster-scan order Displaying the reconstructed image SRAM
Hardware Implementation ARM Development Board implements Tri-state AHB buses An AHB master is designed for reading and writing the image data to the SRAMs available on the logic tile. 2 SRAM controllers are instantiated in the design as slaves on AHM M1 and AHM M2 buses. System Arbiter controls the multiplexing
Design Flow Modify HDL files Verilog modules Synthesis Constraints Compiler High Effort for Speed Leonardo Spectrum Modify Logic Optimizer Constraints Met? No Yes Netlist for XC2V8000 Place and Route Constraints High Effort for Speed Modify Mapper Constraints Met? Translator Xilinx Project Navigator No Placer Yes Bitstream Options Router Resulting bitsream Bitsream for XC2V8000
OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work
Conclusions • Transform – Quant architecture is designed and verified to work at 81 MHz • Mode Decision, Intra Prediction and CAVLC are integrated. • Top – Level design is synthesized at 61.4MHz and placed & routed at 53.8MHz. • Device utilization for XC2V8000 FPGA is approximately 23% with a total equivalent gate count of 1,051,458. • The H.264 Intra Frame Coder System is verified to work on an ARM Versatile Platform development board.
Future Work • Implementing header generation functionality • Further verification by decoding the generated bitstream using an H.264 compliant decoder • Implementing low-power techniques such as clock gating • Adding a camera to the system for real-time video capturing and coding • Developing an ASIC implementation and fabricating a prototype • Creating a complete H.264 video coding system by integrating motion estimation, motion compensation, deblocking filter,intra vs. inter mode decision and rate control units
Thanks ? Questions...