360 likes | 548 Views
ELEC692 VLSI Signal Processing Architecture Lecture 7. VLSI Architecture for Block Matching Algorithm for Video compression. * Part of the notes is taken from the course notes of Prof. Bing Zeng’s ELEC 533. Reference.
E N D
ELEC692 VLSI Signal Processing ArchitectureLecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken from the course notes of Prof. Bing Zeng’s ELEC 533
Reference • P. Pirsch, N. Demassieux, W. Gehrke, “VLSI architecture for Video compression – A survey”, in ther IEEE Proceedings, Vol. 83, No. 2, pp. 220-246,Feb 1995 • T. Komarek, P. Pirsch, “Array Architecture for Block Matching Algorithm”, in IEEE Transactions of Circuit and Systems, vol. 36, No. 10, pp. 1301-1310, Oct. 1989
Interframe Transform/Predictive Coding • Prediction is based on a previously processed frame • Prediction is accomplished by motion estimation (ME) • Motion estimation is done in spatial domain • 2-D DCT has to be inside the coding loop and a 2-D IDCT is needed to convert the frequency domain information back to spatial domain
Block matching Criterion • Mean Square Error (MSE) • Mean Absolute Difference (MAD)
Important factors for BM Motion Estimation • Block size – 8X8, 16X16, variable • Size of searching window • Depend on frame differences, speed of moving objects, resolution, etc • Matching criterion • Accuracy vs complexity, use of truncated pixels • Search strategy • Full search, hierarchical search, subsampling of motion field • Hardware consideration
Real time processing for BMA • Let Block size = 16*16, window size = 32*32, assuming CIF frame at 30f/s, we need For CCIR 601 or HDTV, it will require several or tens of GOPS/sec. So Full search has to be implemented in dedicated hardware.
Exhaustive Search Block Matching • Block size of N X N of the current image (reference block, denote by X) • Matched with all the block located within a search window (candidate blocks, denote by Y). • Maximum displacement – w • Computing the mean absolute difference (MAD) between the blocks • Matching distance D is given by V is the motion vector No. of candidate block to be considered: (2w+1)2
Algorithm to find the motion vector Dmin = MAXVALUE Vmin = (0,0) For m=-w to +w for n = -w to +w D(m,n) = 0 for i=1 to N for j = 1 to N D(m,n) = D(m,n)+|x(I,j)-y(i+m,j+n)| endfor endfor if D(m,n) < Dmin then Dmin = D(m,n) Vmin = (m,n) endif endfor endfor
Dependency graph Calculating MAD Calculate Dmin and v Calculating si(m.n) and s(m,n)
Dependency graph • The BM algorithm can be described by several different dependency graph • Example 1 AD = absolute difference and addition M = minimum value computation
Dependency graph • Example 2
Data input • Line scan and block scan • Line scan • TV lines run through as a whole, from the upper to the lower side of the frame • Block scan • Quadratic blocks of n X n pixels are run through in a block-line manner • Well suited if the data are supplied by a memory with block scan output • Pixels within a block are traversed column by column • E.g. (3X3)-pixel block Data are read in the order x(1,1), x(2,1) x(3,1), x(1,2), x(2, 2) x(3,2), x(1,3), x(2,3) x(3,3),
Mapping BMA onto Systolic Arrays • Decompose the algorithm into its basic operations and convert it into a form where each result is assigned to a unique variable • Formulate it as an n-dimension dependence graph (DG) of computation nodes and data dependence arcs. • One straight forward mapping is implementing a PE designated to each node of the DG and a communication link to each edge of the DG. • More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes • Need time schedule and assignment of multiple nodes to a single PE by projection. PE need to be programmable to some extent.
Mapping BMA onto Systolic Arrays • The BMA is defined over a 4-dimensional index space (i,j,m,n) • The BMA can be decomposed into two parts which are defined over two-dimesional index spaces. • 1st one spawn by the index I,j, finding the sum of D(m,n) • 2nd one defined over m and n, the minium search and the selectin of displacement vector
Transform it into a 2D -array • D(m,n) mapped into a 2D array of PE • V(X,Y) is mapped into time
Realistic implementation of 2-D array • Reduction of the cycle time • Pipelining of the computation of D(m,n). • I/O management • Each of the AD-PE receives a new value of y(m+i,n+j) at each clock cycle. • Transmitting the N2 value from an external memory is not feasible. WE can take the advantage of that these values belong to the search window. • A portion of the search window of size N.(2w+N) is stored in the circuit in a 2D bank of shift registers, able to shift in, up, down, and right direction. • Each AD-PE has one of these registers and can at each cycle obtain the value of y(m+i,n+j) that it needs • To update this register bank, a new column of 2w+N piexls of the serach area is serially entered in the circuit and is inserted in the back of regigters. • To load in a new reference with a low I/O overhead, a double buffering of x(I,j) is required, with the pixels x’(I,j) of a new reference block serially loaded during the computation of the current reference block.
2-D array • Alternate projection of the DG onto the I and j –plane provides the architecture AB2 • Current frame data x(i,j) remains fixed in the PE’s AD that they have to be loaded into the array before. Time required= (2w+1)*(2w+1)
Mapping to a 1-D array • More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes • Mapped to a 1D array of PE, which is able to compute in parallel the partial distortion along one row. • Compute D(m,n) in N cycles
1-D array • Project the DG along the i-axis onto a one-dimensional signal flow graph. • Called AB1 array, it has the size of a block Consecutive computation of all (2w+1)2 candidate blocks per displacement vector may provide N*(2p+1)2 time instances
Another way of mapping-search area based • The dependency graph for computing v(X,Y) is mapped into a 2D array of (2w+1)2 PE while the dependency graph for computing D(m,n) is mapped into time • Each PE working in parallel keeps track of a particular distortion computation and sequentially explore the reference block. • At each cycle, one PE receives a different vlaue of y(m+I,n+j) and all the PE receive the value of one pixel of the reference block which is broadcasted to the array. • After N2 cycle, each of the (2w+1)2 PE holds one value of D(m,n) corresponding to a particular displacement (m,n) • To find the minimum distortion value, find the minimum of a column by downshifting the D(m,n) in the PEs and find the final minimum value by left-shifting the result D(m,n) in the M-PE.
2-D search area based architecture Part of the search area of size w.(2w+N) is needed to be stored in order to reduce I/O.
1-D search area based architecture • An array of (2w+1) processing elements executes in N2 cycles the computation of the distortion D(m,n) corresponding to one line (resp. column) of possible motion vectors. • This process is repeated sequentially 2w+1 times for computing all the distortion.
Another architecture • Require only a sequential data input. • Dummy data denotes by dots are inserted into the stream of reference data to guarantee a regular data flow without any data permutation within the array Time required = (2w+1)*(2w+1)*N