170 likes | 322 Views
Modular Refinement of H.264 Kermin Fleming. What is H.264?. Mobile Devices Low bit-rate Video Decoder Follow on to MPEG-2 and H.26x Operates on pixel blocks Smaller blocks 4x4, 8x4, 4x8 In-loop deblocking filter Base profile Bluespec implementation Works on FPGA!. H.264 Overview.
E N D
Modular Refinement of H.264 Kermin Fleming
What is H.264? • Mobile Devices • Low bit-rate • Video Decoder • Follow on to MPEG-2 and H.26x • Operates on pixel blocks • Smaller blocks 4x4, 8x4, 4x8 • In-loop deblocking filter • Base profile Bluespec implementation • Works on FPGA!
H.264 Modules • NAL unwrap • Unwraps network packets • Byte stream separated by special tags • Entropy Decoder • Decodes various slices, parameters • Primarily Golomb encoded • Residual data uses CAVLC • Inverse Transform • Reconstructs whole blocks • Quantized frequency coefficients
H.264 Modules • Intra-prediction • Prediction based on previously blocks • Corrected by residual • Inter-predication • Correlation between frames • Motion vectors • Deblocking filter • Removes prediction artifacts • Frame Buffer • Maintains cache of previous frames
Modular Refinement • Latency insensitive design • Data centric • Swap functionally equivalent modules • Design exploration easy • Bluespec generates control • Design timing change? • No problem.
Deblocking Filter Details • Block prediction leaves artifacts • Apply a smoothing filter across macroblock boundaries • Highly configurable Macroblock Filter Order
Original Implementation • Store the whole macroblock • Iteratively filter the macroblock • Store and stream left macroblock • Simple to reason about – very like software • BAD!!!! • Highly sequential • Large storage requirements • Wiring:
Pipelining • Sequential execution was a problem • Unclear how to pipeline design • Data stored in row major • Can be rotated to column major • 16-stage pipeline • Horizontal Filter • Row-to-Column • Vertical Filter • Column-to-Row
Pipelining • Parallelism Improved • Two filtrations per cycle • Memory Reduced • 5/8 of macroblock stored • Accesses simplified • Fewer Filters • Only need one… • Design now far more complex • 2x code size
Pipeline Issues • Throughput improved, but not perfect • Structural Hazards • Loads and Stores to the Above memory • Third and Fourth Macroblocks conflict • Both need to be rotated at the same time • Outputing Left Blocks • Pipeline drain • Control data shared • Pipeline control state
Relaxed Memory Ordering • Original Sequential Ordering too conservative • Above data is not immediately used • Allowing stores to bypass loads • Separate load and store request queues • Stalls eliminated • Design complexity stays the same • Artificial dependency removed
Side Buffering • Frequent conflicts between 4x4 blocks • Store one of them in a side buffer • When the resource is available, release the stored data • Sometimes ordering matters – sometimes not • Memory acts a reorder buffer • Encode priority in rule • Deadlock can be a problem…
Other Refinement • Pipelined Interpredict rules • Chroma interpolation • Improved Interpolator filter implementation • Improved memory subsystem • Previously too general • Needless crossbar Interpolation Sampling
Results • Nearly 60 fps at 1080p • Power, area, and throughput improvements • Fast Deblocking filter implementation • Faster than any known implementation • Does it really matter?