300 likes | 427 Views
A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos , Theodoridis George VLSI Design Lab. Electrical & Computer Eng. Department University of Patras, Greece. Outline. Deblocking filter algorithm Filtering ordering Memory organization
E N D
A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTERKefalas Nikolaos, Theodoridis GeorgeVLSI Design Lab.Electrical & Computer Eng. Department University of Patras, Greece
Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work
Deblocking Filter Algorithm (1/3) • The deblocking filter is used in H.264/AVC to reduce the blocking artifacts • Improves subjective & objective quality and reduces the bit-rate typically 5-10%. • It is performed on a macroblock (MB) basis after the completion of the macroblock reconstruction stage • It includes a large number of data depended branches – each 4x4 pixel area is processed up to four times • It spends over one-third (1/3)of the total decoding time
Deblocking Filter Algorithm (2/3) • Each MB is processed in 4x4 blocks • The vertical edges are filtered at first rightwards • from edge V0 to edge V3 • Then horizontal ones downwards • from edge H0 to H3 • Each 8 pixels of two adjacent 4x4 sub-blocks are filtered at the same time • The same process repeats for the chroma components
Deblocking Filter Algorithm (3/3) • Each sub-edge shares a BS value • The BS along with two thresholds α,βdecides the filtering strength of each sub-edge • A filter samples flag is calculated • Three filter types are used • Strong filter (4- or 5-tap filter) • Weak filter • No filtering
Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work
Filtering Order • During filtering all four sub-edges of each sub-block are filtered and almost all pixels are involved and updated • A suitable filtering order is needed to: • Reduce the size of the on-chip memory for buffering intermediate data • Increase data reuse • Reduce the external memory accesses • Simplify control and steering logic • Avoid pipeline stalls due to data and resource hazards
Proposed Filtering Order • The vertical sub-edges are filtered in raster scan-order followed by the horizontal ones • The filtering direction is not changed before all vertical edges of luma and chroma are filtered • The proposed order is in accordance to the standard
Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work
Memory Organization (1/2) Four single port memories are employed (sizes in bits) • Current-A (CM-A) 96x32 • Current-B (CM-B) 96x32 • Left _mem (LM) 32x32 • Upper_mem (UM) 2xFWx32 + 2x(FW/16)x32 • Transpose buffers TR-P and TR-Q (4x32) – typical systolic array All internal buses are 32 bits
Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work
Algorithm Features • Deblocking filter algorithm computational intensive operations • LUT operations – retrieve values α(IndexA), β(IndexB), c1(Index A, BS) • BS calculation • Weak Filter BS(1~3) filtering, δcalculation and clipping operations • Strong Filter BS(4) • The introduced pipeline exploits specific algorithmic features • BS is the same for all micro-edges of a sub-edge for the luma component • BS of the luma component is reused for the chroma components • For the (4:2:0) format BS changes every 2 micro-edges in chroma components
Pipeline Operation • Each sub-block needs 4 cycles to be processed • The BS unit spends 4 cycles (BS calculation & LUT operations) • BS and LUT operations are do not depend on pixel values • BS calculation & LUT operations are overlapped with the filtering operations for the luma component • Four initialization cycles are needed to calculate the BS and the α,β, c1 for the first luma sub-block
BS=4 Filtering Filter equations modified to improve delay & area BS=4 – 13 adders instead of 28 Total components Adders: 13+14+4=31
Pipeline Benefits • LUT operations and BS calculation are not squeezed in a single pipeline stage • Bs Unit has 4-cycles • The filtering operations are expanded in three pipeline stages • The BS values are reused for filtering the chroma components • Modification of the original filtering equations (improve performance & area) • The proposed ordering eases control logic and memory addressing avoiding any potential critical path increase
Vertical Edge Filter Process • Total cycles = 4*27= 108 • If two port memory has been used then total cycles = 4x24=96 which is the optimum
Processing Cycles • Vertical Edges: 108 cycles • Horizontal Edges: 108 cycles • Initialize: 10 cycles • 6 fetch coding info, initialize control • 4 1st BS calculation • Normal operation: 226 cycles • For the last row (edges 27, 31, 35, 41, 45): 5x4=20 extra cycles • Resource hazard (Bus conflict) • For the last MB in frame 12 extra cycles are needed (edges 39, 43, 47) • Resource hazard (Bus conflict) • Worst case total cycles: 258
Outline • Deblocking filter algorithm • Filtering ordering • Memory organization • Pipelined architecture • Synthesis results and comparisons • Conclusions and future work
Experimental Setup • Synthesis Setup • Synopsys design compiler • TSMC 0.18um • FPGA proven • Stand alone, compared with the JM reference software • It has also verified as a part of a H.264 hardware encoder • It achieves 280 MHz in Virtex 5 speed grade 3
Synthesis Results and Comparisons 1:1P: Single-Port, 2P: Two-Port, 2::FW: Frame width, 3: Filtering cycles only, 4: Filtering cycles only, 5: It takes 246 cycles to filter a MB at the right frame boundary, 6: It takes 246 cycles to filter a MB at the bottom frame row, 7: The 2x(FW/16)x32 bits are stored in upper memory
Conclusions • A novel high speed pipeline architecture for the H.264/AVC deblocking filter is proposed • It operates at 400 MHz and occupies 19.2 Kgates in 0.18 um CMOS technology • It achieves 216 and 54 fps for Full and Ultra-HD frames, respectively • Only single port memories are employed • No external memory accesses are needed during filtering • Parameters and neighbors are store internally • Only fully filtered data are written to external memories
Hardware Architecture (Pipeline organization) 5/ Threshold Calculation
Deblocking Filter Algorithm 3/3 • Each sub-edge between two adjacent 4x4 luma sub-blocks share a Boundary Strength (BS) • The BS value along with two threshold variables, α and β, decide the filtering strength of the sub-edge
Hardware Architecture (Pipeline organization) 5/ Bs 1,2,3 filter
Deblocking Filter Algorithm 4/4 • Boundary strength across horizontal edges • The boundary strength is calculated for each sub-edge for the luma component • It is reused for the chroma components in 2:1 ratio for 4:2:0 format