170 likes | 321 Views
Sum of Absolute Differences Hardware Accelerator. Mark Lodermeier. Outline. Overview of Motion Estimation and MPEG4 – Part 10 - AVC Approach Tasks Performed To Do. Motion Estimation. Used for video compression – block matching between successive frames
E N D
Sum of Absolute Differences Hardware Accelerator Mark Lodermeier
Outline • Overview of Motion Estimation and MPEG4 – Part 10 - AVC • Approach • Tasks Performed • To Do
Motion Estimation • Used for video compression – block matching between successive frames • Search for best matching block (find motion vectors) • Used to create model of current frame using reference frame(s) from either previous or future frames • Motion vectors found by determining the minimum SAD MV(r,s) = argmin[SAD(x,y,r,s)]
Motion Estimation • Full Search algorithm produces the best results. • However, computationally expensive. • Motion Estimation accounts for 50-70% of computational complexity in MPEG-4 video encoding/decoding • For a small search range of [-8, +7] in each direction with 16x16 macroblocks, there are 16*16 pixel comparisons performed 16*16 times = 65,536 additions of absolute differences for a single 16x16 block • Real-time video of a 480x640
MPEG-4 Part 10 – AVC • Variable Block Sizes • Each 16x16 Macroblock can be split in half into two 16x8 or 8x16 blocks or into four 8x8 sub-blocks • These sub-blocks can then be split in half into two 8x4 or 4x8 blocks or into four 4x4 blocks.
MPEG-4 Part 10 – AVC • Previous 16x16 macroblock split into smaller blocks
Purpose: Many small blocks requires large amount of bits to encode Few large blocks may produce poor quality Can produce higher efficiency with same quality Challenges: Generate motion vectors for all block sizes - Increase computation cost to an already intensive algorithm Choose the correct block size among many choices to balance bandwidth and quality MPEG4 Variable Block Sizes
Approach • How to efficiently generate all MV’s for variable sized blocks? • Take full advantage of parallel nature of both Motion Estimation and the generation of variable sized blocks • Maintain high processor utilization
Tasks Performed • Implemented 1-D systolic array in VHDL • 16 Processing Elements, each with: • Absolute Difference unit • 9 to 2 Compressor • 3 to 1 Compressor
Absolute Difference • Math behind absolute difference unit: • Just check condition - A > B • B + B_not = 2n – 1 B_not = 2n – 1 – B • 2n-1+|A-B| is the value of the sum of the two outputs of the absolute difference unit. • Need to add a correction term of m to get rid of the 2n-1, where m is equal to the number of absolute difference units used.
C B A Abs Diff Unit Abs Diff Unit Abs Diff Unit Abs Diff Unit 9 to 2 Adder Reduction Tree 3 to 1 Reduction Tree Latch Single Processing Element Correction term - 4
Systolic Array C D D D D B A … PE0 PE1 PE2 PE15 4x4 SAD and MV 4x4 SAD and MV 4x4 SAD and MV 4x4 SAD and MV Shift Register Shift Register Shift Register Shift Register … control 16 4x4 SAD values and MV’s 8 4x8 SAD values and MV’s 8 8x4 SAD values and MV’s Back-End Adder Array for Variable Blocks 4 8x8 SAD values and MV’s 2 8x16 SAD values and MV’s 2 16x8 SAD values and MV’s 1 16x16 SAD value and MV
Back-End Adder Array Diagram 12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4x4 SAD’s and MV’s 8 8x4 SAD’s and MV’s 8 4x8 SAD’s and MV’s 4 8x8 SAD’s and MV’s 2 8x16 and 2 16x8 SAD’s and MV’s 16x16 SAD and MV Macroblock with 16 4x4 sub-blocks - Each dot represents the following: Min Latch
Another way to represent the generation of motion vectors for the variable sized blocks p x p Matrix • First you take the pxp matrices from the specified 4x4 blocks to generate the matrix for the 4x8 block (where p is the total search range) • To find the corresponding motion vector, you just search for the minimum value in the matrix SAD values for one 4x4 block in all search positions p x p Matrix SAD values for one 4x8 block in all search positions p x p Matrix SAD values for one 4x4 block in all search positions
The top box is the current frame data and the bottom box is the reference frame data Every 4 clock cycles a PE will produce a 4x4 SAD value It takes 16 clock cycles to fill the systolic array and have 100% Processor Utilization After 64 clock cycles PE0 will have completed a 16x16 macroblock search for one position, Every other PE will then do the same on the following cycle So, to do a full search of (-8, 7) in the x and y positions, a total of 64x16 = 1024 cycles are needed. The Schedule for the Accelerator: