COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/2012

OUTLINE • Introduction • Motivation • Network-on-Chip (NoC) • ASIC based approaches • Coarse grain architectures • Proposed Architecture • Results

INTRODUCTION • Goal • Application specific hybrid coarse grained reconfigurable architecture using NoC • Purpose • Support Variable Block Size Motion Estimation (VBSME) • First approach • No • ASIC and other coarse grained reconfigurable architectures • Difference • Use of intelligent NoC routers • Support full and fast search algorithms

MOTIVATION H.264 Ө(f)= Motion Estimation

MOTION ESTIMATION Sum of Absolute Difference (SAD) Search Window Current 16x16 Block Motion Vector Previous Frame Current Frame

SYSTEM-ON-CHIP (SOC) • Single chip systems • Common components • Microprocessor • Memory • Co-processor • Other blocks • Increased processing power and data intensive applications • Facilitating communication between individual blocks has become a challenge

TECHNOLOGY ADVANCEMENT

DELAY VS. PROCESS TECHNOLOGY

NETWORK-ON-CHIP (NOC) • Efficient communication via use of transfer protocols • Need to take into consideration the strict constraints of SoC environment • Types of communication structure • Bus • Point-to-point • Network

COMMUNICATION STRUCTURES

BUS VS. NETWORK

EXAMPLE

EXAMPLE OF NOC

ROUTER ARCHITECTURE

BACKGROUND • ME • General purpose processors, ASIC, FPGA and coarse grain • Only FBSME • VBSME with redundant hardware • General purpose processors • Can exploit parallelism • Limited by the inherent sequential nature and data access via registers

CONTINUED… • ASIC • No support to all block sizes of H.264 • Support provided at the cost of high area overhead • Coarse grained • Overcome the drawbacks of LUT based FPGAs • Elements with coarser granularity • Fewer configuration bits • Under utilization of resources

ASIC Approaches Topology SAD accumulation 1D systolic array 1D systolic array 2D systolic array 2D systolic array 2D systolic array Partial Sum Partial Sum Partial Sum Parallel Sum Parallel Sum Parallel Sum • Large number of registers • Store partial SADs • Area overhead • High latency • Mesh based architecture • Store partial SADs • Area overhead • High latency • No VBSME • Reference pixels broadcasted • SAD computation for each 4x4 block pipelined • Each processing element computes pixel difference, accumulates it to the previous partial SAD and sends the computed partial SAD to the next processing element • Large number of registers • All pixel differences of a 4x4 block computed in parallel • Reference pixels are reused • Direction of data transfer depends on search pattern

OU’S APPROACH • 16 SAD modules to process 16 4x4 motion vectors • VBSME processor • Chain of adders and comparators to compute larger SADs • PE array • Basic computational element of SAD module • Cascade of 4 1D arrays • 1D array • 1D systolic array of 4 PEs • Each PE computes a 1 pixel SAD

current_block_data_i current_block_data_0 search_block_data_0 D D Module 0 block_strip_A block_strip_B SAD_0 MV_0 32 bits current_block_data_1 search_block_data_1 32 bits Module 1 1D Array 0 1D Array 3 SAD_1 MV_1 1 bit current_block_data_15 search_block_data_15 1 bit Module 15 SAD_i SAD_15 MV_15 4 bits MUX for SAD MV_i strip_sel read_addr_B read_addr_A write_addr PE Array SAD Modules

1D Array 32 bits 32 bits PE D PE D D D PE D D D D D PE D D D ACCM

PUTTING IT TOGETHER • Clock cycle • Columns of current 4x4 sub-block scheduled using a delay line • Two sets of search block columns broadcasted • 4 block matching operations executed concurrently per SAD module • 4x4 SADs -> 4x4 motion vectors • Chain of adders and comparators • 4x4 SADs -> 4x8 SADs -> … 16x16 SADs • Chain of adders and comparators • Drawbacks • No reuse of search data between modules • Resource wastage

ALTERNATIVE SOLUTION: COARSE GRAIN ARCHITECTURES ChESS *(M x 0.8M)/256 x 17 x 17 MATRIX *(M x0.8M)/256 x 17 x 17 RaPiD *272+32M+14.45M2 • Resource utilization • Generic interconnect * Performance (clock cycles) [Frame Size: M x 0.8M]

PROPOSED ARCHITECTURE • 2D architecture • 16 CPEs • 4 PE2s • 1 PE3 • Main Memory • Memory Interface • CPE (Configurable Processing Element) • PE1 • NoC router • Network Interface • Current and reference block from main memory

Main Memory Memory Interface (MI) c_d_(x,y) (32 bits) r_d_(x,y) (32 bits) data_load_control (16 bits) reference_block_id (5 bits) c_d r_d r_d c_d c_d c_d 32 bits CPE (1,1) CPE (1,2) CPE (1,3) CPE (1,4) 12 bits r_d r_d PE 2(2) PE 2(1) c_d r_d c_d r_d 14 bits c_d c_d CPE (2,1) CPE (2,2) CPE (2,3) CPE (2,4) r_d r_d PE 3 c_d c_d CPE (3,1) CPE (3,2) CPE (3,3) CPE (3,4) r_d r_d PE 2(3) r_d c_d r_d PE 2(4) c_d c_d c_d CPE (4,1) CPE (4,2) CPE (4,3) CPE (4,4) r_d r_d r_d c_d c_d r_d

c_d r_d To/From NI 1 8 bit sub CPR 2 8 bit sub CPR 3 8 bit sub CPR 4 8 bit sub CPR RPR RPR RPR RPR To/From East 10 bit adder 10 bit adder 5 8 bit sub CPR 6 8 bit sub CPR 7 8 bit sub CPR 8 8 bit sub CPR RPR RPR RPR RPR 4x4 mv COMP REG 12 bit adder 9 8 bit sub CPR 10 8 bit sub CPR 11 8 bit sub CPR 12 8 bit sub CPR RPR RPR RPR RPR 10 bit adder 10 bit adder 13 8 bit sub CPR 14 8 bit sub CPR 15 8 bit sub CPR 16 8 bit sub CPR RPR RPR RPR RPR To/From South

NETWORK INTERFACE reference_block_id to MI CONTROL UNIT data_load_control to MI PACKETIZATION UNIT DEPACKETIZATION UNIT Network Interface

NOC ROUTER Input/Output Control Signals request request Sends packets to NI or adjacent router Receives packets from NI/ adjacent router ack ack Input Controller Output Controller PE 1 PE 1 East East West Header Decoder West 3 North North 2 4 South South 1 5 Last Index First Index • XY routing protocol • Extracts direction of data transfer from header packet • Updates number of hops 0 Ring Buffer Stores packets

Router 1 Router 2 Input Controller Input Controller Busy? ack (1 bit) packet 32 bit Buffer space available? Output Controller Output Controller req (1 bit) Step 1: Send a message from Router 1 to Router 2 Step 2: Send a 1 bit request signal to Router 2 Step 3: Router 2 first checks if it is busy. If not checks for available buffer space Step 4: Send ack if space available Step 5: Send the packet

PE2 AND PE3 De-muxes Muxes Adders Comparators Registers

FAST SEARCH ALGORITHM • Diamond Search • 9 candidate search points • Numbers represent order of processing the reference frames • Directed edges labeled with data transmission equations derived based on data dependencies

EXAMPLE Frame Macro-block SAD

CONTINUED…

DATA TRANSFER Data Transfer between PE1(1,1) and PE1(1,3) Individual Points Intersecting Points

DATA LOAD SCHEDULE

OTHER FAST SEARCH ALGORITHMS Hexagon Spiral Big Hexagon

FULL SEARCH

CONTINUED…

RESULTS

CONTINUED…

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION

Presentation Transcript

SmartCell: A Coarse-Grained Reconfigurable Architecture for High Performance and Low Power Embedded Computing

Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures

Implementation of Variable-Block-Size Motion Estimation Algorithm Using Systolic Array Architecture

High Speed Systolic Array Structure for Variable Block Size Motion Estimation Vinod Reddy 05/04/2009

Course-Grained Reconfigurable Devices

A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation

Course-Grained Reconfigurable Architectures

REGIMap: Register-Aware Application Mapping on Coarse-Grained Reconfigurable Architectures

Creating Coarse-grained Parallelism for Loop Nests

Coarse-grained Word Sense Disambiguation

Compiling for Coarse-Grained Adaptable Architectures

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator

Coarse-Grained Transactions

Coarse-Grained Transactions

ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture

Fine-grained and Coarse-grained Word Sense Disambiguation

Enhanced Hexagonal Search for Fast Block Motion Estimation

Atomistic vs. Coarse Grained Simulations

Paper Review I Coarse Grained Reconfigurable Arrays

Coarse-Grained Coherence

Commutativity and Coarse-Grained Transactions

DT-CGRA: Dual-Track Coarse-Grained Reconfigurable Architecture for Stream Applications