540 likes | 764 Views
Motion Estimation Based Frame Rate Conversion Hardware Design s. by Özgür Taşdizen PhD Thesis Sabancı University May 2010. Outline. Introduction Motion Estimation Hexagon Based Motion Estimation Algorithm and Hardware Architectures for Its Implementation
E N D
Motion Estimation Based Frame Rate Conversion Hardware Designs by Özgür Taşdizen PhD Thesis Sabancı University May 2010
Outline • Introduction • Motion Estimation • Hexagon Based Motion Estimation Algorithm and Hardware Architectures for Its Implementation • Dynamic Step Search Motion Estimation Algorithms and a Hardware Architecture for Their Implementation • Computation Reductions for Vector Median Filtering • Frame Interpolation Hardware • Conclusions
Frame Rate Up-Conversion (FRC) • FRC is the conversion of a lower frame rate video signal to a higher frame rate video signal. • Broadcasting standards are fixed to 50 Hz for PAL and 60 Hz for NTSC and movies are recorded either in 24,25 or 30fps, whereas currently available flat panel displays support frame rates of up to 240 Hz. • Simple techniques degrade the quality of the video, whereas motion estimation based techniques producing high quality results are difficult to implement in real-time.
Motion Estimation (ME) • ME is the process of finding similarities between adjacent frames. • Application areas: Video Compression, FRC, De-interlacing, De-Noising and Super Resolution. • Block Based ME is the most preferred method due to its simplicity. Previous Frame Current Frame Motion Vectors (MVs)
Block Based ME The goal is finding the closest matching block. M, N: MacroBlock (MB) of Size MxN d(dx, dy): MV c : Current Frame r : Reference Frame [-p,p] : Search Window
Full Search (FS) • FS computes SADs for each search location in the search window. • Processing of HD videos require larger search ranges due to larger motions between consecutive frames. There are (2p + 1)2 search locations in a (±p, ±p) search window. • A search range of (±63, ±63) pixels consists of 16129 search locations.Implementing FS in this search range for 1920x1080 resolution and 25 fps video requires 2.51 TOPS. Search locations of FS (p=4)
Fast Search Algorithms • They are developed for low bit-rate applications and they try to reach the PSNR of FS by checking fewer search locations. • However, they obtain lower PSNR results for larger search ranges which is necessary for HD video. • In addition, they are not suitable for HW implementation due to their sequential nature. • Successful fast search algorithms: TSS, 2D-LS, NTSS, FSS, BBGDS, DS, HEXBS, ARPS, ADCS, FTS. Pixel Location Search Location of First Step Search Location of Second Step Search Location of Third Step Selected Location of First Step Selected Location of Second Step Selected Location of Third Step
Motivation • Fast search ME algorithms perform very well for low bit-rate applicationssuch as video phone and video conferencing. • However, fast search algorithms do not produce satisfactory results for the recently available consumer electronics devices such as high frame rate HD flat panel display. • The computational complexity of FS algorithm is very high, especially for the recently available HD applications. • Therefore, we propose new ME algorithms, which have similar performance with FS algorithm, and hardware architectures for efficiently implementing these algorithms in order to support real-time processing of HD videos on low cost FPGA devices.
Hexagon Based Search (HEXBS) • Consists of two search patterns; coarse and fine. • Each new hexagon brings three new search locations. Pixel Location Search Location of First Step Search Location of Second Step Search Location of Third Step Search Location of Fourth Step Selected Location of First Step Selected Location of Second and Third Steps CoarsePattern FinePattern Selected Location of Fourth
Proposed Hexagon Based ME Algorithm • Generalization of the HEXBS ME algorithm • Consists of two search steps: Main and Fine • 32x16 main search pattern consist of all the search locations that can be checked by HEXBS algorithm during several iterations • Used with various fine search patterns Fine Search Patterns Plus Side DoubleCross 32x16 Main Search Pattern
Proposed Smaller Search Patterns • 10x9, 12x12, 14x15 main search patterns have two pixel gap in the vertical direction. 10x9 Main Search Pattern
Benchmark Suite • Used to evaluate the performances of proposed search patterns. • The video sequences are 100 frames long. Foreman (352x288,15 fps) TableTennis (704x480,15 fps) Flowers (704x480,15 fps) Susie (704x480,15 fps) Spiderman (720x576,25 fps) Gladiator (720x576,25 fps) IRobot (720x576,25 fps)
Performance Evaluation MAD (u,v) = MAD results (Frame Distance = 2)
BRAM 0 BRAM BRAM 1 BRAM BRAM 2 BRAM BRAM 3 BRAM 4 32 32 32 32 32 Horizontal Rotator 128 BRAM 5 BRAM BRAM 6 BRAM BRAM 7 BRAM BRAM 8 BRAM 9 32 32 32 32 32 128 Horizontal Rotator 128 128 128 128 128 BRAM 10 BRAM 11 BRAM 12 BRAM 13 BRAM 14 128 128 32 32 32 32 32 128 128 Horizontal Rotator 128 128 128 128 128 128 128 Vertical Rotator 128x16 16 16x16 PE Array 128 128 Adder Tree SAD 128 128 128 128 128 128 128 128 128 128 128 128 128 128 BRAM 75 BRAM 76 BRAM 77 BRAM 78 BRAM 79 32 32 32 32 32 Horizontal Rotator 128 Proposed 16x16 Generic Architecture
Datapath of The Systolic Hardware Architecture • 256 Processing Elements (PE) to process 16x16 MB • 16 Block RAMs (BRAM), each configured as 16-bit wide 01234567............... 16 8 BRAM 0 PE 0,0 PE 1,0 PE 2,0 PE 3,0 PE 4,0 PE 5,0 PE 6,0 PE 7,0 PE 8,0 PE 9,0 PE 10,0 PE 11,0 PE 12,0 PE 13,0 PE 14,0 PE 15,0 8 8 8 16 8 8 BRAM 1 PE 0,1 PE 1,1 PE 2,1 PE 3,1 PE 4,1 PE 5,1 PE 6,1 PE 7,1 PE 8,1 PE 9,1 PE 10,1 PE 11,1 PE 12,1 PE 13,1 PE 14,1 PE 15,1 8 8 16 BRAM 2 PE 0,2 PE 1,2 PE 2,2 PE 3,2 PE 4,2 PE 5,2 PE 6,2 PE 7,2 PE 8,2 PE 9,2 PE 10,2 PE 11,2 PE 12,2 PE 13,2 PE 14,2 PE 15,2 VERTICAL ROTATOR 16 from each PE ADDER TREE SAD 16 BRAM 15 PE 0,15 PE 1,15 PE 2,15 PE 3,15 PE 4,15 PE 5,15 PE 6,15 PE 7,15 PE 8,15 PE 9,15 PE 10,15 PE 11,15 PE 12,15 PE 13,15 PE 14,15 PE 15,15 8 254 255 8
Dataflow Through The Systolic PE Array • Loading the reference data of the initial search location takes 8 clock cycles. • Consequtive search locations require only 1 clock cycle . • Therefore, completing the search for a single MB takes 672, 236, 176, and 122 clock cyles for 32x16, 14x15, 12x12 and 10x9 patterns, respectively.
Implementation Results • Proposed architectures are implemented in VHDL, verified with simulations using Modelsim and mapped to a low-cost XC3S1200E-5 FPGA using Xilinx ISE. • The systolic architecture consumes 6648 LUTs and 16 BRAMs. • Both hardware architectures can run at 144 MHz when implemented on a low cost XC3S1200E-5 FPGA, and they can process 25 1920x1080 frames per second for the largest search range of (±32,±16) pixels.
Summary • A Hexagon-Based ME algorithm having lower computational complexity than FS ME algorithm is proposed. • The simulation results showed that the PSNR obtained by this algorithm is better than the PSNR obtained by other fast search algorithms. • Two high performance hardware architectures, generic and systolic, for implementing this algorithm are proposed. • Generic architecture can be used for implementing any search pattern but its on-chip memory area and on-chip memory bandwidth requirement is high. • Systolic architecture is an effective way of implementing proposed search patterns.
Dynamically Variable Step Search (DVSS) • DVSS has different number and size of search steps which can be dynamically reconfigured. • DVSS algorithm decreases the computational complexity by adaptively changing between search patterns A1, A2, A3. • The number of steps and the search range of each step are determined for the current block based on the size and SAD value of the previously found MV for the left neighboring block, which is called as Left Neighboring Motion Vector (LNMV). Search Range of First Step Search Range of Second Step Search Range of Third Step Number of Search Locations
Pseudo Code of DVSS Algorithm • If LNMV falls within a smaller search range, DVSS decreases the search granularity and search range size, because for small motions doing the search in a smaller search range is sufficient and doing a finer granularity search in a smaller search range can give better MAD results. • If there is no left neighboring block • Do Pattern A1 • Else if SAD value of LNMV exceeds the threshold (τ) • Switch to next coarser pattern • Else • If LNMV is within (±8, ±4) pixels • Do FS in (±10, ±5) search range • Else if LNMV is within (±16, ±8) pixels • Do Pattern A3 • Else if LNMV is within (±24, ±12) pixels • Do Pattern A2 • Else • Do Pattern A1
MAD Results of DVSS Algorithm • The MAD performance gap between the search pattern A1 (405 search locations) and FS (4753 search locations) is only 7.5% on the average. • DVSS decreases the computational complexity significantly with a small decrease in the MAD performance.
4 select 4 load 9addr B 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 256 x8 32 16x16 Reconfigurable PE Array 16 BRAMs Dual Port (512 x 32bit) 32 32 56 32 32 Multiplexing Unit Vertical Rotators Shift Registers Adder Tree data A 32 32 56 data B 32 32 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 32 32 56 32 32 addr A 16 9 4 7 2 SAD start Top-Level Controller 11 coordinate rotate amount shift enable select signals jump amount jump 2 Control Unit Min. SAD done 16 13 MV clk start done Proposed Reconfigurable and SystolicHardware Architecture
PE 0,0 PE 0,1 PE 0,2 PE 0,3 PE 0,4 PE 0,5 PE 0,13 PE 0,14 PE 0,15 m m m m PE 1,0 PE 1,1 PE 1,2 PE 1,3 PE 1,4 PE 1,5 PE 1,13 PE 1,14 PE 1,15 m m m m PE 15,0 PE 15,1 PE 15,2 PE 15,3 PE 15,4 PE 15,5 PE 15,13 PE 15,14 PE 15,15 m m m m 16x16 Reconfigurable PE Array • Multiplexers are placed between PEs to implement reconfigurability.
Dataflow Through The Reconfigurable & Systolic PE Array • The reference pixels for the first search location in a line of the search window are loaded in 4 clock cycles. • SAD value of next search location is calculated in 1 cycle. • Reference data is shifted to right in the PE array in each consecutive clock cycle and shift amount can be 4, 2 or 1 pixels depending on the type of the step; coarse, medium or fine respectively.
Performance of Proposed Hardware Architecture Search Pattern Supported Frame Size & Rate A1 633 205371 1920x1080, 25.3 fps B 957 135841 1366x768, 33.1 fps C 1221 106470 1366x768, 25.9 fps 10x9 122 1180327 1920x1080, 145.7 fps 14x15 236 610169 1920x1080, 75.3 fps 32x16 672 214285 1920x1080, 26.4 fps 48x24 1425 101052 1366x768, 24.6 fps FS 5103 25475 720x576, 15.7 fps • Works at 130MHz on a XC3S1500-5, and consumes 9128 slices and 16 BRAMs. • A1, A2, and A3 require 633, 357, and 380 clock cycles, respectively. • DVSS requires on the average 467 cycles per MB when τ is set to 256 and can support 34.3 HD fps. Required Clock Cycles per MB Processed MBs per Second
Performance of Proposed Hardware Architecture for DVSS Algorithm Required Cycles for 100 Frames MBs per Frame Average Cycles per MB Video Sequence Supported 1920x1080 fps
Recursive Dynamically Variable Step Search (RDVSS) • RDVSS has three distinct search paths: • Temporal Search: Can track the camera movement, applied if Motion Vector Fields (MVFs) are similar • Spatial Search: Can track the object movement, applied if Temporal path fails • Main Search: Performs well for complex motions, applied if Spatial and Temporal paths fail Main Search has a maximum range of (±64,±64) pixels Spatially Searched Areas MB (i-1,j-1, t) MB (i,j-1, t) MB (i+1,j-1,t) Current MB Temporally Searched Area MB (i-1,j, t) MB (i,j, t)
RDVSS Search Patterns • As in the DVSS algorithm, each search pattern has a maximum of three different granularity search steps with different size search ranges. • In the first, second, and third steps, horizontal and vertical distances between search locations are 4, 2, and 1 pixels, respectively.
Early Search Termination Based on an Adaptive Threshold Level • The minimum SAD values of four neighboring MBs (i-1,j-1,t), (i,j-1,t), (i+1,j-1,t), and (i-1,j,t) are selected and compared with a pre-determined threshold (τ). The comparison selects the greater value as the early search termination level.
Pseudo Code of RDVSS Algorithm Iteration 1: If (TD is equal or less than (±4,±4) pixels) Do Recursive Small Pattern around MV(i,j,t-1) Else if (TD is equal or less than (±8,±8) pixels) Do Recursive Medium Pattern around MV(i,j,t-1) Else if (TD is equal or less than (±16,±16) pixels) Do Recursive Large Pattern around MV(i,j,t-1) Else Do 1x1 Full Search Pattern around MV(i,j,t-1) Iteration 2: If (SD is equal or less than (±3,±3) pixels) Do 3x3 Full Search Pattern around ASNMV Else Do 1x1 Full Search Pattern around MV(i-1,j-1,t), MV(i,j-1,t), MV(i+1,j-1,t), and MV(i-1,j,t) Iteration 3: If (SD is equal or less than (±16,±16) pixels) Main Small Pattern around (0,0) Else if (SD is equal or less than (±32,±32) pixels) Do Main Medium Pattern around (0,0) Else Do Main Large Pattern around (0,0) Until (Main Large Pattern is used) Do next larger Main Pattern around (0,0) Temporal Search Spatial Search Main Search
HD Videos Added to The Benchmark Suite “IceAge2”, “ParkJoy1080p”, “Ducks”, and “ParkJoy720p” are 50 frames long and remaining videos are 100 frames long. IceAge2 (1920x1080,25 fps) ParkJoy (1920x1080,25 fps) Spiderman3 (1280x576,25 fps) Ducks (1280x760,25 fps) SthlmPan (1280x760,25 fps)
MAD Performance Results of RDVSS • When threshold level is set to 256, RDVSS performs 14.7% close to FS. • RDVSS has nearly the same performance with the main coarse pattern, which checks 1113 search locations. • DVSS gives slightly better results for videoscontaining very small motions, because it checks moresearch locations in the fine search step.
Computation Savings of RDVSS • FS checks 16641 search locations within (±64,±64) search window, whereas RDVSS checks 418 (97.5% less) search locations on the average when the threshold value is set to 1024. • Whencompared with DVSS for a maximum search rangeof (±48, ±24) pixels and for the same threshold level(τ=256), RDVSS searches 34% less search locationson the average while giving nearly the same MADresults. Number of search locations
Vector Median Filter (VMF) • VMFs are non-linear filters. • They are mainly used for removing the noise from a signal by smoothing out the signal. Recently, they are used forFRC for finding the true motion information. • The output of the VMF is chosen as the vector among the input vectors that minimizes the sum of distances to all the other vectors. • VMFs are difficult to implement in real-time because of their high computational complexity.
Smoothing Current frame and its MVF Original MVF and smoothed MVF
Proposed Data-Reuse Technique Some of the vector distances for consecutive filtering windows : (a) tn , (b) tn+1 , (c) tn+2 Vectors belonging to 3x3 windows Required number of computations for various window sizes Savings of data-reuse technique
Proposed Spatial Correlations Techniques • Correlation 1 compares new vectors entering the window among themselves. • Correlation 2 compares new vectors with vector in the middle of the current filtering window. • Correlation 3 compares new vectors with the remaining vectors of the current filtering window. • Correlation 1, 2 and 3 require (N2-N), 2N and 2(N3-N2) comparisons, respectively. • Correlation 2 and 3 require 2 and 2(N2-N) storage operations, respectively. Number of comparison operations for proposed techniques Number of storage operations for proposed techniques
Computation Reductions for 3x3 VMF • For a 3x3 window size, the proposed data-reuse and spatial correlations 1 and 2 techniques together reduced the number of operations from 414 to 114.
“Dif” Parameter Computation reductions for 3x3 VMF for various “dif” values Average computation reductions
VMF Hardware • To the best of our knowledge, no hardware architecture is presented in the literature implementing theproposed techniques. • The proposed architecture is implemented for a 3x3 filtering window but it is scalable to any filtering window size. Top-level Block Diagram Weighting & Minimum Selector
Implementation Results • The proposed hardware architecture is implemented in VHDL, and mapped to a low cost Xilinx XC3S400A-5 FPGA. • It consumes 1426 slices. • It can work at 145 MHz. • Since processing a 1920x1080 HD frame with MVs corresponding to 4x4 MBs requires 1569440 clock cycles, processing a frame takes 10.824ms. Therefore, the proposed hardware can process 92 HD frames per second when there is no computation reduction by using spatial correlations. When spatial correlations 1 and 2 are used, on the average, 110 HD frames per second can be processed.
Frame Interpolation • Real-time interpolation of HD frames is a major design challenge. • We propose a low cost reconfigurable hardwarearchitecture forreal-timeimplementation of frame interpolation algorithms. • The main bottleneck in an FRC system is accessing the frame memory. The required memory bandwidth of the example system shown below is 6.5 GB/s.
Frame Interpolation Techniques Linear Interpolation Motion Compensated Averaging Static Median Filtering Dynamic Median Filtering Soft Switching Cascaded Median Filtering
PSNR Results • All the ME based techniques perform better than LI. • Although checking much fewer search locations, DVSS obtains similar PSNR results with FS. The performance gap between DVSS and FS is less than 1%.
Frame Interpolation Hardware • The proposed architecture implements LI, MCA, SMF, DMF, SS and CMF. • It allows adaptive selection of these algorithms for each MB. • It takes the selected interpolation algorithm and the MV as inputs.
Datapath and BRAMs • Each box labeled R, G, B represents a processing element. • 30 16-bit rotators are used for aligning the interpolated MB.