1 / 40

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION. 03/26/2012. OUTLINE. Introduction Motivation Network-on-Chip (NoC) ASIC based approaches Coarse grain architectures Proposed Architecture Results. INTRODUCTION. Goal

derron
Download Presentation

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/2012

  2. OUTLINE • Introduction • Motivation • Network-on-Chip (NoC) • ASIC based approaches • Coarse grain architectures • Proposed Architecture • Results

  3. INTRODUCTION • Goal • Application specific hybrid coarse grained reconfigurable architecture using NoC • Purpose • Support Variable Block Size Motion Estimation (VBSME) • First approach • No • ASIC and other coarse grained reconfigurable architectures • Difference • Use of intelligent NoC routers • Support full and fast search algorithms

  4. MOTIVATION H.264 Ө(f)= Motion Estimation

  5. MOTION ESTIMATION Sum of Absolute Difference (SAD) Search Window Current 16x16 Block Motion Vector Previous Frame Current Frame

  6. SYSTEM-ON-CHIP (SOC) • Single chip systems • Common components • Microprocessor • Memory • Co-processor • Other blocks • Increased processing power and data intensive applications • Facilitating communication between individual blocks has become a challenge

  7. TECHNOLOGY ADVANCEMENT

  8. DELAY VS. PROCESS TECHNOLOGY

  9. NETWORK-ON-CHIP (NOC) • Efficient communication via use of transfer protocols • Need to take into consideration the strict constraints of SoC environment • Types of communication structure • Bus • Point-to-point • Network

  10. COMMUNICATION STRUCTURES

  11. BUS VS. NETWORK

  12. EXAMPLE

  13. EXAMPLE OF NOC

  14. ROUTER ARCHITECTURE

  15. BACKGROUND • ME • General purpose processors, ASIC, FPGA and coarse grain • Only FBSME • VBSME with redundant hardware • General purpose processors • Can exploit parallelism • Limited by the inherent sequential nature and data access via registers

  16. CONTINUED… • ASIC • No support to all block sizes of H.264 • Support provided at the cost of high area overhead • Coarse grained • Overcome the drawbacks of LUT based FPGAs • Elements with coarser granularity • Fewer configuration bits • Under utilization of resources

  17. ASIC Approaches Topology SAD accumulation 1D systolic array 1D systolic array 2D systolic array 2D systolic array 2D systolic array Partial Sum Partial Sum Partial Sum Parallel Sum Parallel Sum Parallel Sum • Large number of registers • Store partial SADs • Area overhead • High latency • Mesh based architecture • Store partial SADs • Area overhead • High latency • No VBSME • Reference pixels broadcasted • SAD computation for each 4x4 block pipelined • Each processing element computes pixel difference, accumulates it to the previous partial SAD and sends the computed partial SAD to the next processing element • Large number of registers • All pixel differences of a 4x4 block computed in parallel • Reference pixels are reused • Direction of data transfer depends on search pattern

  18. OU’S APPROACH • 16 SAD modules to process 16 4x4 motion vectors • VBSME processor • Chain of adders and comparators to compute larger SADs • PE array • Basic computational element of SAD module • Cascade of 4 1D arrays • 1D array • 1D systolic array of 4 PEs • Each PE computes a 1 pixel SAD

  19. current_block_data_i current_block_data_0 search_block_data_0 D D Module 0 block_strip_A block_strip_B SAD_0 MV_0 32 bits current_block_data_1 search_block_data_1 32 bits Module 1 1D Array 0 1D Array 3 SAD_1 MV_1 1 bit current_block_data_15 search_block_data_15 1 bit Module 15 SAD_i SAD_15 MV_15 4 bits MUX for SAD MV_i strip_sel read_addr_B read_addr_A write_addr PE Array SAD Modules

  20. 1D Array 32 bits 32 bits PE D PE D D D PE D D D D D PE D D D ACCM

  21. PUTTING IT TOGETHER • Clock cycle • Columns of current 4x4 sub-block scheduled using a delay line • Two sets of search block columns broadcasted • 4 block matching operations executed concurrently per SAD module • 4x4 SADs -> 4x4 motion vectors • Chain of adders and comparators • 4x4 SADs -> 4x8 SADs -> … 16x16 SADs • Chain of adders and comparators • Drawbacks • No reuse of search data between modules • Resource wastage

  22. ALTERNATIVE SOLUTION: COARSE GRAIN ARCHITECTURES ChESS *(M x 0.8M)/256 x 17 x 17 MATRIX *(M x0.8M)/256 x 17 x 17 RaPiD *272+32M+14.45M2 • Resource utilization • Generic interconnect * Performance (clock cycles) [Frame Size: M x 0.8M]

  23. PROPOSED ARCHITECTURE • 2D architecture • 16 CPEs • 4 PE2s • 1 PE3 • Main Memory • Memory Interface • CPE (Configurable Processing Element) • PE1 • NoC router • Network Interface • Current and reference block from main memory

  24. Main Memory Memory Interface (MI) c_d_(x,y) (32 bits) r_d_(x,y) (32 bits) data_load_control (16 bits) reference_block_id (5 bits) c_d r_d r_d c_d c_d c_d 32 bits CPE (1,1) CPE (1,2) CPE (1,3) CPE (1,4) 12 bits r_d r_d PE 2(2) PE 2(1) c_d r_d c_d r_d 14 bits c_d c_d CPE (2,1) CPE (2,2) CPE (2,3) CPE (2,4) r_d r_d PE 3 c_d c_d CPE (3,1) CPE (3,2) CPE (3,3) CPE (3,4) r_d r_d PE 2(3) r_d c_d r_d PE 2(4) c_d c_d c_d CPE (4,1) CPE (4,2) CPE (4,3) CPE (4,4) r_d r_d r_d c_d c_d r_d

  25. c_d r_d To/From NI 1 8 bit sub CPR 2 8 bit sub CPR 3 8 bit sub CPR 4 8 bit sub CPR RPR RPR RPR RPR To/From East 10 bit adder 10 bit adder 5 8 bit sub CPR 6 8 bit sub CPR 7 8 bit sub CPR 8 8 bit sub CPR RPR RPR RPR RPR 4x4 mv COMP REG 12 bit adder 9 8 bit sub CPR 10 8 bit sub CPR 11 8 bit sub CPR 12 8 bit sub CPR RPR RPR RPR RPR 10 bit adder 10 bit adder 13 8 bit sub CPR 14 8 bit sub CPR 15 8 bit sub CPR 16 8 bit sub CPR RPR RPR RPR RPR To/From South

  26. NETWORK INTERFACE reference_block_id to MI CONTROL UNIT data_load_control to MI PACKETIZATION UNIT DEPACKETIZATION UNIT Network Interface

  27. NOC ROUTER Input/Output Control Signals request request Sends packets to NI or adjacent router Receives packets from NI/ adjacent router ack ack Input Controller Output Controller PE 1 PE 1 East East West Header Decoder West 3 North North 2 4 South South 1 5 Last Index First Index • XY routing protocol • Extracts direction of data transfer from header packet • Updates number of hops 0 Ring Buffer Stores packets

  28. Router 1 Router 2 Input Controller Input Controller Busy? ack (1 bit) packet 32 bit Buffer space available? Output Controller Output Controller req (1 bit) Step 1: Send a message from Router 1 to Router 2 Step 2: Send a 1 bit request signal to Router 2 Step 3: Router 2 first checks if it is busy. If not checks for available buffer space Step 4: Send ack if space available Step 5: Send the packet

  29. PE2 AND PE3 De-muxes Muxes Adders Comparators Registers

  30. FAST SEARCH ALGORITHM • Diamond Search • 9 candidate search points • Numbers represent order of processing the reference frames • Directed edges labeled with data transmission equations derived based on data dependencies

  31. EXAMPLE Frame Macro-block SAD

  32. CONTINUED…

  33. DATA TRANSFER Data Transfer between PE1(1,1) and PE1(1,3) Individual Points Intersecting Points

  34. DATA LOAD SCHEDULE

  35. OTHER FAST SEARCH ALGORITHMS Hexagon Spiral Big Hexagon

  36. FULL SEARCH

  37. CONTINUED…

  38. RESULTS

  39. CONTINUED…

More Related