Overview

ECE 697FReconfigurable ComputingLecture 17Dynamic Reconfiguration IAcknowledgement: Andre DeHon

Overview • Motivation for Dynamic Reconfiguration • Limitations of Reconfigurable Approaches • Hardware support for reconfiguration • Examples application • ASCII to Hex conversion (Dehon)

Context Identifier Address Inputs (Inst. Store) . . . Computation Unit (LUT) . . . in Programming may differ for each element out DPGA • Configuration selects operation of computation unit • Context identifier changes over time to allow change in functionality • DPGA – Dynamically Programmable Gate Array

Computations that Benefit from Reconfiguration • Low throughput tasks • Data dependent operations • Effective if not all resources active simultaneously. Possible to time-multiplex both logic and routing resources. A F0 F1 F2 B Non-pipelined example

Resource Reuse • Example circuit: Part of ASCII -> Hex design • Computation can be broken up by considering this a data flow graph.

Resource Reuse • Resources must be directed to do different things at different times through instructions. • Different local configurations can be thought of as instructions • Minimizing the number and size of instructions a key to successfully achieving efficient design. • What are the implications for the hardware?

Actxt80Kl2 dense encoding Abase800Kl2 Previous Study (DeHon) Interconnect Mux Logic Reuse • * Each context no overly costly compared to base cost of wire, switches, IO circuitry Question: How does this effect scale?

Exploring the Tradeoffs • Assume ideal packing: Nactive = Ntotal/L • Reminder: c*Actxt = Abase • Difficult to exactly balance resources/demands • Needs for contexts may vary across applications • Robust point where critical path length equals # contexts.

Implementation #1 Implementation #2 NA= 3 NA= 4 Implementation Choices • Both require same amount of execution time • Implementation #1 more resource efficient.

Scheduling Limitations • NA = size of largest stage in terms of active LUTs • Precedence -> a LUT can only be evaluated after predecessors have been evaluated. • Need to assign design LUTs to device • LUTs at specific contexts. • Consider formulation for scheduling. What are the choices?

Scheduling • ASAP (as soon as possible) • Propagate depth forward from primary inputs • Depth = 1 + max input length • ALAP (as late as possible) • Propagate distance from outputs backwards towards inputs • Level = 1 + max output consumption level • Slack • Slack = L + 1 – (depth + level) • PI depth = 0, PO level = 0

Slack Example • Note connection from C1 to O1 • Critical path will have 0 slack • Admittedly small example

Sequentialization • Adding time slots allows for potential increase in hardware efficiency • This comes at the cost of increased latency • Adding slack allows better balance. • L=4 NA = 2 (4 or 3 contexts)

Full ASCII -> Hex Circuit • Logically three levels of dependence * Single Context: 21 LUTs @ 880Kl2=18.5Ml2

Time-multiplexed version • Three contexts: 12 LUTs @ 1040Kl2=12.5Ml2 • Pipelining need for dependent paths.

Context Optimization • With enough contexts only one LUT needed. Leads to poor latency. • Increased LUT area due to additional stored configuration information • Eventually additional interconnect savings taken up by LUT configuration overhead Ideal = perfect scheduling spread + no retime overhead

General Throughput Mapping • Useful if only limited throughput is desired. • Target produces new result every t cycles (e.g. a t LUT path) • Spatially pipeline every t stages. • Cycle = t • Retime to minimize register requirement • Multi-context evaluation within a spatial stage. Retime to minimize resource usage • Map for depth, i, and contexts, C

Dharma Architecture (UC Berkeley) • Allows for levelized circuit to be executed • Design parameters - #DLM • K -> number of DLM inputs • L -> number of levels

Example Dharma Circuit

Levelization of Circuit • Levelization performed on basis od dependency graph. • Functions implemented as 3 input LUTs

Detailed View of Dharma

Example: DPGA Prototype

Example: DPGA Area

Summary • Multiple contexts can be used to combat wire inactivity and logic latency • Too many contexts lead to inefficiencies due to retiming registers and extra LUTs • Architectures such as DPGA and Dharma address these issues through contexts • Run-time system needed to handle dynamic reconfiguration.

Overview

Overview

Presentation Transcript

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview