420 likes | 625 Views
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION. 03/26/2012. OUTLINE. Introduction Motivation Network-on-Chip (NoC) ASIC based approaches Coarse grain architectures Proposed Architecture Results. INTRODUCTION. Goal
E N D
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/2012
OUTLINE • Introduction • Motivation • Network-on-Chip (NoC) • ASIC based approaches • Coarse grain architectures • Proposed Architecture • Results
INTRODUCTION • Goal • Application specific hybrid coarse grained reconfigurable architecture using NoC • Purpose • Support Variable Block Size Motion Estimation (VBSME) • First approach • No • ASIC and other coarse grained reconfigurable architectures • Difference • Use of intelligent NoC routers • Support full and fast search algorithms
MOTIVATION H.264 Ө(f)= Motion Estimation
MOTION ESTIMATION Sum of Absolute Difference (SAD) Search Window Current 16x16 Block Motion Vector Previous Frame Current Frame
SYSTEM-ON-CHIP (SOC) • Single chip systems • Common components • Microprocessor • Memory • Co-processor • Other blocks • Increased processing power and data intensive applications • Facilitating communication between individual blocks has become a challenge
NETWORK-ON-CHIP (NOC) • Efficient communication via use of transfer protocols • Need to take into consideration the strict constraints of SoC environment • Types of communication structure • Bus • Point-to-point • Network
BACKGROUND • ME • General purpose processors, ASIC, FPGA and coarse grain • Only FBSME • VBSME with redundant hardware • General purpose processors • Can exploit parallelism • Limited by the inherent sequential nature and data access via registers
CONTINUED… • ASIC • No support to all block sizes of H.264 • Support provided at the cost of high area overhead • Coarse grained • Overcome the drawbacks of LUT based FPGAs • Elements with coarser granularity • Fewer configuration bits • Under utilization of resources
ASIC Approaches Topology SAD accumulation 1D systolic array 1D systolic array 2D systolic array 2D systolic array 2D systolic array Partial Sum Partial Sum Partial Sum Parallel Sum Parallel Sum Parallel Sum • Large number of registers • Store partial SADs • Area overhead • High latency • Mesh based architecture • Store partial SADs • Area overhead • High latency • No VBSME • Reference pixels broadcasted • SAD computation for each 4x4 block pipelined • Each processing element computes pixel difference, accumulates it to the previous partial SAD and sends the computed partial SAD to the next processing element • Large number of registers • All pixel differences of a 4x4 block computed in parallel • Reference pixels are reused • Direction of data transfer depends on search pattern
OU’S APPROACH • 16 SAD modules to process 16 4x4 motion vectors • VBSME processor • Chain of adders and comparators to compute larger SADs • PE array • Basic computational element of SAD module • Cascade of 4 1D arrays • 1D array • 1D systolic array of 4 PEs • Each PE computes a 1 pixel SAD
current_block_data_i current_block_data_0 search_block_data_0 D D Module 0 block_strip_A block_strip_B SAD_0 MV_0 32 bits current_block_data_1 search_block_data_1 32 bits Module 1 1D Array 0 1D Array 3 SAD_1 MV_1 1 bit current_block_data_15 search_block_data_15 1 bit Module 15 SAD_i SAD_15 MV_15 4 bits MUX for SAD MV_i strip_sel read_addr_B read_addr_A write_addr PE Array SAD Modules
1D Array 32 bits 32 bits PE D PE D D D PE D D D D D PE D D D ACCM
PUTTING IT TOGETHER • Clock cycle • Columns of current 4x4 sub-block scheduled using a delay line • Two sets of search block columns broadcasted • 4 block matching operations executed concurrently per SAD module • 4x4 SADs -> 4x4 motion vectors • Chain of adders and comparators • 4x4 SADs -> 4x8 SADs -> … 16x16 SADs • Chain of adders and comparators • Drawbacks • No reuse of search data between modules • Resource wastage
ALTERNATIVE SOLUTION: COARSE GRAIN ARCHITECTURES ChESS *(M x 0.8M)/256 x 17 x 17 MATRIX *(M x0.8M)/256 x 17 x 17 RaPiD *272+32M+14.45M2 • Resource utilization • Generic interconnect * Performance (clock cycles) [Frame Size: M x 0.8M]
PROPOSED ARCHITECTURE • 2D architecture • 16 CPEs • 4 PE2s • 1 PE3 • Main Memory • Memory Interface • CPE (Configurable Processing Element) • PE1 • NoC router • Network Interface • Current and reference block from main memory
Main Memory Memory Interface (MI) c_d_(x,y) (32 bits) r_d_(x,y) (32 bits) data_load_control (16 bits) reference_block_id (5 bits) c_d r_d r_d c_d c_d c_d 32 bits CPE (1,1) CPE (1,2) CPE (1,3) CPE (1,4) 12 bits r_d r_d PE 2(2) PE 2(1) c_d r_d c_d r_d 14 bits c_d c_d CPE (2,1) CPE (2,2) CPE (2,3) CPE (2,4) r_d r_d PE 3 c_d c_d CPE (3,1) CPE (3,2) CPE (3,3) CPE (3,4) r_d r_d PE 2(3) r_d c_d r_d PE 2(4) c_d c_d c_d CPE (4,1) CPE (4,2) CPE (4,3) CPE (4,4) r_d r_d r_d c_d c_d r_d
c_d r_d To/From NI 1 8 bit sub CPR 2 8 bit sub CPR 3 8 bit sub CPR 4 8 bit sub CPR RPR RPR RPR RPR To/From East 10 bit adder 10 bit adder 5 8 bit sub CPR 6 8 bit sub CPR 7 8 bit sub CPR 8 8 bit sub CPR RPR RPR RPR RPR 4x4 mv COMP REG 12 bit adder 9 8 bit sub CPR 10 8 bit sub CPR 11 8 bit sub CPR 12 8 bit sub CPR RPR RPR RPR RPR 10 bit adder 10 bit adder 13 8 bit sub CPR 14 8 bit sub CPR 15 8 bit sub CPR 16 8 bit sub CPR RPR RPR RPR RPR To/From South
NETWORK INTERFACE reference_block_id to MI CONTROL UNIT data_load_control to MI PACKETIZATION UNIT DEPACKETIZATION UNIT Network Interface
NOC ROUTER Input/Output Control Signals request request Sends packets to NI or adjacent router Receives packets from NI/ adjacent router ack ack Input Controller Output Controller PE 1 PE 1 East East West Header Decoder West 3 North North 2 4 South South 1 5 Last Index First Index • XY routing protocol • Extracts direction of data transfer from header packet • Updates number of hops 0 Ring Buffer Stores packets
Router 1 Router 2 Input Controller Input Controller Busy? ack (1 bit) packet 32 bit Buffer space available? Output Controller Output Controller req (1 bit) Step 1: Send a message from Router 1 to Router 2 Step 2: Send a 1 bit request signal to Router 2 Step 3: Router 2 first checks if it is busy. If not checks for available buffer space Step 4: Send ack if space available Step 5: Send the packet
PE2 AND PE3 De-muxes Muxes Adders Comparators Registers
FAST SEARCH ALGORITHM • Diamond Search • 9 candidate search points • Numbers represent order of processing the reference frames • Directed edges labeled with data transmission equations derived based on data dependencies
EXAMPLE Frame Macro-block SAD
DATA TRANSFER Data Transfer between PE1(1,1) and PE1(1,3) Individual Points Intersecting Points
OTHER FAST SEARCH ALGORITHMS Hexagon Spiral Big Hexagon