Low Power and Low Area Transform–Quant & Inverse Quant–Inverse Transform Hardware Design for

Low Power and Low Area Transform–Quant & Inverse Quant–Inverse Transform Hardware Design for H.264 Encoder

Outline I. H.264 TQ & IQIT II. DESIGNED HARDWARE III. RESULTS

H.264 TQ & IQIT • Each residual macroblock is transformed, quantized. • Previous standards such as MPEG-1,MPEG-2, MPEG-4 and H.263 made use of the 8x8 Discrete Cosine Transform (DCT) as the basictransform. • The “baseline” profile of H.264 uses three transforms depending on “the type of residualdata : • A transform for the 4x4 array of luma DC coefficients in intra macroblocks(predicted in 16x16 mode), • A transform for the 2x2 array of chroma DC coefficients (in anymacroblock) • 3) A transform for all other 4x4 blocks in the residual data.

Work accomplished ... ( T, Q, IQ, IT) ... Future work ( MC, toplevel, ...)

Data within a macroblock are transmitted in the order shown in Figure If the macroblock is codedin 16x16 Intra mode, then the block labelled “-1” is transmitted first, containing the DC coefficient ofeach 4x4 luma block. Next, the luma residual blocks 0-15 are transmitted in the order shown (with theDC coefficient set to zero in a 16x16 Intra macroblock). Blocks 16 and 17 contain a 2x2 array of DCcoefficients from the Cb and Cr chroma components respectively. Finally, chroma residual blocks 18-25 (with zero DC coefficients) are sent. The entire process of transform and quantization can be carried out using 16-bit integer arithmetic

4x4 Integer Transform &Inverse Transform It is an integer transform The core part of the transform is multiply-free, it only requires additions and shifts. A scaling multiplication (part of the complete transform) is integrated into the quantizer (reducingthe total number of multiplications).

4x4 Forward Integer Transform [(x0+x4+x8+x12) + (x1+x5+x9+x13) + (x2+x6+x10+x14) + (x3+x7+x11+x15), 2*(x0+x4+x8+x12) + (x1+x5+x9+x13) - (x2+x6+x10+x14) - 2*(x3+x7+x11+x15), (x0+x4+x8+x12) - (x1+x5+x9+x13) - (x2+x6+x10+x14) + (x3+x7+x11+x15), (x0+x4+x8+x12) - 2*(x1+x5+x9+x13) + 2*(x2+x6+x10+x14) - (x3+x7+x11+x15); (2*x0+x4-x8-2*x12) + (2*x1+x5-x9-2*x13) + (2*x2+x6-x10-2*x14) + (2*x3+x7-x11-2*x15), 2*(2*x0+x4-x8-2*x12) + (2*x1+x5-x9-2*x13) - (2*x2+x6-x10-2*x14) - 2*(2*x3+x7-x11-2*x15), (2*x0+x4-x8-2*x12) - (2*x1+x5-x9-2*x13) - (2*x2+x6-x10-2*x14) + (2*x3+x7-x11-2*x15), (2*x0+x4-x8-2*x12) - 2*(2*x1+x5-x9-2*x13) + 2*(2*x2+x6-x10-2*x14) - (2*x3+x7-x11-2*x15); (x0-x4-x8+x12) + (x1-x5-x9+x13) + (x2-x6-x10+x14) + (x3-x7-x11+x15), 2*(x0-x4-x8+x12) + (x1-x5-x9+x13) - (x2-x6-x10+x14) - 2*(x3-x7-x11+x15), (x0-x4-x8+x12) - (x1-x5-x9+x13) - (x2-x6-x10+x14) + (x3-x7-x11+x15), (x0-x4-x8+x12) - 2*(x1-x5-x9+x13) + 2*(x2-x6-x10+x14) - (x3-x7-x11+x15); (x0-2*x4+2*x8-x12) + (x1-2*x5+2*x9-x13) + (x2-2*x6+2*x10-x14) + (x3-2*x7+2*x11-x15), 2*(x0-2*x4+2*x8-x12) + (x1-2*x5+2*x9-x13) - (x2-2*x6+2*x10-x14) - 2*(x3-2*x7+2*x11-x15), (x0-2*x4+2*x8-x12) - (x1-2*x5+2*x9-x13) - (x2-2*x6+2*x10-x14) + (x3-2*x7+2*x11-x15), (x0-2*x4+2*x8-x12) - 2*(x1-2*x5+2*x9-x13) + 2*(x2-2*x6+2*x10-x14) - (x3-2*x7+2*x11-x15)]

4x4 Inverse Integer Transform [(y0 + y4 + y8 + y12/2) + (y1 + y5 + y9 + y13/2) + (y2 + y6 + y10 + y14/2) + 1/2 * (y3 + y7 + y11 + y15/2), (y0 + y4 + y8 + y12/2) + 1/2 * (y1 + y5 + y9 + y13/2) - (y2 + y6 + y10 + y14/2) - (y3 + y7 + y11 + y15/2), (y0 + y4 + y8 + y12/2) - 1/2 * (y1 + y5 + y9 + y13/2) - (y2 + y6 + y10 + y14/2) + (y3 + y7 + y11 + y15/2), (y0 + y4 + y8 + y12/2) - (y1 + y5 + y9 + y13/2) + (y2 + y6 + y10 + y14/2) - 1/2 * (y3 + y7 + y11 + y15/2); (y0 + y4/2 - y8 - y12) + (y1 + y5/2 - y9 - y13) + (y2 + y6/2 - y10 - y14) + 1/2 * (y3 + y7/2 - y11 - y15), (y0 + y4/2 - y8 - y12) + 1/2 * (y1 + y5/2 - y9 - y13) - (y2 + y6/2 - y10 - y14) - (y3 + y7/2 - y11 - y15), (y0 + y4/2 - y8 - y12) - 1/2 * (y1 + y5/2 - y9 - y13) - (y2 + y6/2 - y10 - y14) + (y3 + y7/2 - y11 - y15), (y0 + y4/2 - y8 - y12) - (y1 + y5/2 - y9 - y13) + (y2 + y6/2 - y10 - y14) - 1/2 * (y3 + y7/2 - y11 - y15); (y0 - y4/2 - y8 + y12) + (y1 - y5/2 - y9 + y13) + (y2 - y6/2 - y10 + y14) + 1/2 * (y3 - y7/2 - y11 + y15), (y0 - y4/2 - y8 + y12) + 1/2 * (y1 - y5/2 - y9 + y13) - (y2 - y6/2 - y10 + y14) - (y3 - y7/2 - y11 + y15), (y0 - y4/2 - y8 + y12) - 1/2 * (y1 - y5/2 - y9 + y13) - (y2 - y6/2 - y10 + y14) + (y3 - y7/2 - y11 + y15), (y0 - y4/2 - y8 + y12) - (y1 - y5/2 - y9 + y13) + (y2 - y6/2 - y10 + y14) - 1/2 * (y3 - y7/2 - y11 + y15); (y0 - y4 + y8 - y12/2) + (y1 - y5 + y9 - y13/2) + (y2 - y6 + y10 - y14/2) + 1/2 * (y3 - y7 + y11 - y15/2), (y0 - y4 + y8 - y12/2) + 1/2 * (y1 - y5 + y9 - y13/2) - (y2 - y6 + y10 - y14/2) - (y3 - y7 + y11 - y15/2), (y0 - y4 + y8 - y12/2) - 1/2 * (y1 - y5 + y9 - y13/2) - (y2 - y6 + y10 - y14/2) + (y3 - y7 + y11 - y15/2), (y0 - y4 + y8 - y12/2) - (y1 - y5 + y9 - y13/2) + (y2 - y6 + y10 - y14/2) - 1/2 * (y3 - y7 + y11 - y15/2)]

Quantization For QP>5, the factors MF remain unchanged but the divisor 2qbits increases by a factor of 2 for eachincrement of 6 in QP. >> indicates a binary shift right. In the reference model software, f is 2qbits/3 for Intra blocks or 2qbits/6 for Inter blocks.

Inverse Quantization a further constant scalingfactor of 64 to avoid rounding errors the rescaled output increase by a factor of 2 for everyincrement of 6 in QP. The values at the output of the inverse transform are divided by 64 to remove thescaling factor

4x4 luma DC coefficient Transform & Quantization 16x16 Intra-mode only an inverse Hadamard transform is applied followed by rescaling (note that the order isnot reversed as might be expected) If QP is greater than or equal to 12, rescaling is performed by: If QP is less than 12, rescaling is performed by:

4x4Forward & InverseHadamard Transform [(z0+z4+z8+z12) + (z1+z5+z9+z13) + (z2+z6+z10+z14) + (z3+z7+z11+z15), (z0+z4+z8+z12) + (z1+z5+z9+z13) - (z2+z6+z10+z14) - (z3+z7+z11+z15), (z0+z4+z8+z12) - (z1+z5+z9+z13) - (z2+z6+z10+z14) + (z3+z7+z11+z15), (z0+z4+z8+z12) - (z1+z5+z9+z13) + (z2+z6+z10+z14) - (z3+z7+z11+z15); (z0+z4-z8-z12) + (z1+z5-z9-z13) + (z2+z6-z10-z14) + (z3+z7-z11-z15), (z0+z4-z8-z12) + (z1+z5-z9-z13) - (z2+z6-z10-z14) - (z3+z7-z11-z15), (z0+z4-z8-z12) - (z1+z5-z9-z13) - (z2+z6-z10-z14) + (z3+z7-z11-z15), (z0+z4-z8-z12) - (z1+z5-z9-z13) + (z2+z6-z10-z14) - (z3+z7-z11-z15); (z0-z4-z8+z12) + (z1-z5-z9+z13) + (z2-z6-z10+z14) + (z3-z7-z11+z15), (z0-z4-z8+z12) + (z1-z5-z9+z13) - (z2-z6-z10+z14) - (z3-z7-z11+z15), (z0-z4-z8+z12) - (z1-z5-z9+z13) - (z2-z6-z10+z14) + (z3-z7-z11+z15), (z0-z4-z8+z12) - (z1-z5-z9+z13) + (z2-z6-z10+z14) - (z3-z7-z11+z15); (z0-z4+z8-z12) + (z1-z5+z9-z13) + (z2-z6+z10-z14) + (z3-z7+z11-z15), (z0-z4+z8-z12) + (z1-z5+z9-z13) - (z2-z6+z10-z14) - (z3-z7+z11-z15), (z0-z4+z8-z12) - (z1-z5+z9-z13) - (z2-z6+z10-z14) + (z3-z7+z11-z15), (z0-z4+z8-z12) - (z1-z5+z9-z13) + (z2-z6+z10-z14) - (z3-z7+z11-z15)]

2x2 chroma DC coefficient Transform &Quantization [ (z0+z2) + (z1+z3), (z0+z2) - (z1+z3); (z0-z2) + (z1-z3), (z0-z2) - (z1-z3)] During decoding, the inverse transform is applied before rescaling Inverse transform is identical If QP is greater than or equal to 6, rescaling is performed by: If QP is less than 6, rescaling is performed by: The rescaled coefficients are replaced in their respective 4x4 blocks of chroma coefficients

DESIGNED HARDWARE

Problems encountered Signed arithmetic Initially designed for 100Mhz Due to creating a dual purpose datapath we get extra MUX delays Hardware specified in the standart to avoid rounding errors Error of the book “H.264 and MPEG-4 Video Compression” ! Unpredicted and unbelievable routing error !

Designed hardware supports up to H.264 level 2.2 (SDTV @ 15 fps).A dual purpose datapath is designed.Transform and Quantization of a 4x4 block is completed in 36 clock cycles.Inverse Quantization of a 4x4 block takes 18 clock cycles.Inverse Transform of a 4x4 block is done in 36 clock cycles.It takes nearly 2400 cycles to complete an intra 16x16 predicted macroblock.Working at 80Mhz designed hardware can process up to 33000 mb’s per second. RESULTS

Synthesis Results Synthesis is done with LeonardoSpectrum Clock frequency is 80MHz Number of ports : 68 Number of nets : 212 Number of instances : 30 Number of references to this view : 0 Total accumulated area : Number of Dffs or Latches : 493 Number of Function Generators : 2688 Number of MUX CARRYs : 148 Number of MUXF5 : 608 Number of MUXF6 : 184 Number of accumulated instances : 3847 Number of global buffers used: 0

Device Utilization for 2V8000ff1152 Resource Used Avail Utilization ----------------------------------------------- IOs 68 824 8.25% Global Buffers 0 16 0.00% Function Generators2688 93184 2.88% CLB Slices 1344 46592 2.88% Dffs or Latches 493 95656 0.52% Block RAMs 0 168 0.00% Block Multipliers 1 168 0.60%

FPGA & ASIC The design can be used either for FPGA or for ASIC. Only one multiplier is used (2V8000ff1152 has 168 block multipliers). A clock frequency of 80 MHz for FPGA is achieved. To be able to reach 80MHz lots of pipelining stages are added. Designed hardware may work at a clock frequency up to 200MHz in ASIC. Removing pipelining registers will decrease the area and power consumption.

Thanks ... ? Questions

Low Power and Low Area Transform–Quant & Inverse Quant–Inverse Transform Hardware Design for