1 / 17

Modular Refinement of H.264 Kermin Fleming

Modular Refinement of H.264 Kermin Fleming. What is H.264?. Mobile Devices Low bit-rate Video Decoder Follow on to MPEG-2 and H.26x Operates on pixel blocks Smaller blocks 4x4, 8x4, 4x8 In-loop deblocking filter Base profile Bluespec implementation Works on FPGA!. H.264 Overview.

gaerwn
Download Presentation

Modular Refinement of H.264 Kermin Fleming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modular Refinement of H.264 Kermin Fleming

  2. What is H.264? • Mobile Devices • Low bit-rate • Video Decoder • Follow on to MPEG-2 and H.26x • Operates on pixel blocks • Smaller blocks 4x4, 8x4, 4x8 • In-loop deblocking filter • Base profile Bluespec implementation • Works on FPGA!

  3. H.264 Overview

  4. H.264 Modules • NAL unwrap • Unwraps network packets • Byte stream separated by special tags • Entropy Decoder • Decodes various slices, parameters • Primarily Golomb encoded • Residual data uses CAVLC • Inverse Transform • Reconstructs whole blocks • Quantized frequency coefficients

  5. H.264 Modules • Intra-prediction • Prediction based on previously blocks • Corrected by residual • Inter-predication • Correlation between frames • Motion vectors • Deblocking filter • Removes prediction artifacts • Frame Buffer • Maintains cache of previous frames

  6. Modular Refinement • Latency insensitive design • Data centric • Swap functionally equivalent modules • Design exploration easy • Bluespec generates control • Design timing change? • No problem.

  7. Deblocking Filter Details • Block prediction leaves artifacts • Apply a smoothing filter across macroblock boundaries • Highly configurable Macroblock Filter Order

  8. Original Implementation • Store the whole macroblock • Iteratively filter the macroblock • Store and stream left macroblock • Simple to reason about – very like software • BAD!!!! • Highly sequential • Large storage requirements • Wiring:

  9. Pipelining • Sequential execution was a problem • Unclear how to pipeline design • Data stored in row major • Can be rotated to column major • 16-stage pipeline • Horizontal Filter • Row-to-Column • Vertical Filter • Column-to-Row

  10. Pipelining • Parallelism Improved • Two filtrations per cycle • Memory Reduced • 5/8 of macroblock stored • Accesses simplified • Fewer Filters • Only need one… • Design now far more complex • 2x code size

  11. Pipeline Issues • Throughput improved, but not perfect • Structural Hazards • Loads and Stores to the Above memory • Third and Fourth Macroblocks conflict • Both need to be rotated at the same time • Outputing Left Blocks • Pipeline drain • Control data shared • Pipeline control state

  12. Relaxed Memory Ordering • Original Sequential Ordering too conservative • Above data is not immediately used • Allowing stores to bypass loads • Separate load and store request queues • Stalls eliminated • Design complexity stays the same • Artificial dependency removed

  13. Side Buffering • Frequent conflicts between 4x4 blocks • Store one of them in a side buffer • When the resource is available, release the stored data • Sometimes ordering matters – sometimes not • Memory acts a reorder buffer • Encode priority in rule • Deadlock can be a problem…

  14. Other Refinement • Pipelined Interpredict rules • Chroma interpolation • Improved Interpolator filter implementation • Improved memory subsystem • Previously too general • Needless crossbar Interpolation Sampling

  15. Results

  16. Results • Nearly 60 fps at 1080p • Power, area, and throughput improvements • Fast Deblocking filter implementation • Faster than any known implementation • Does it really matter?

  17. Questions?

More Related