240 likes | 429 Views
ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon. Overview. Motivation for Dynamic Reconfiguration Limitations of Reconfigurable Approaches Hardware support for reconfiguration Examples application ASCII to Hex conversion (Dehon).
E N D
ECE 697FReconfigurable ComputingLecture 17Dynamic Reconfiguration IAcknowledgement: Andre DeHon
Overview • Motivation for Dynamic Reconfiguration • Limitations of Reconfigurable Approaches • Hardware support for reconfiguration • Examples application • ASCII to Hex conversion (Dehon)
Context Identifier Address Inputs (Inst. Store) . . . Computation Unit (LUT) . . . in Programming may differ for each element out DPGA • Configuration selects operation of computation unit • Context identifier changes over time to allow change in functionality • DPGA – Dynamically Programmable Gate Array
Computations that Benefit from Reconfiguration • Low throughput tasks • Data dependent operations • Effective if not all resources active simultaneously. Possible to time-multiplex both logic and routing resources. A F0 F1 F2 B Non-pipelined example
Resource Reuse • Example circuit: Part of ASCII -> Hex design • Computation can be broken up by considering this a data flow graph.
Resource Reuse • Resources must be directed to do different things at different times through instructions. • Different local configurations can be thought of as instructions • Minimizing the number and size of instructions a key to successfully achieving efficient design. • What are the implications for the hardware?
Actxt80Kl2 dense encoding Abase800Kl2 Previous Study (DeHon) Interconnect Mux Logic Reuse • * Each context no overly costly compared to base cost of wire, switches, IO circuitry Question: How does this effect scale?
Exploring the Tradeoffs • Assume ideal packing: Nactive = Ntotal/L • Reminder: c*Actxt = Abase • Difficult to exactly balance resources/demands • Needs for contexts may vary across applications • Robust point where critical path length equals # contexts.
Implementation #1 Implementation #2 NA= 3 NA= 4 Implementation Choices • Both require same amount of execution time • Implementation #1 more resource efficient.
Scheduling Limitations • NA = size of largest stage in terms of active LUTs • Precedence -> a LUT can only be evaluated after predecessors have been evaluated. • Need to assign design LUTs to device • LUTs at specific contexts. • Consider formulation for scheduling. What are the choices?
Scheduling • ASAP (as soon as possible) • Propagate depth forward from primary inputs • Depth = 1 + max input length • ALAP (as late as possible) • Propagate distance from outputs backwards towards inputs • Level = 1 + max output consumption level • Slack • Slack = L + 1 – (depth + level) • PI depth = 0, PO level = 0
Slack Example • Note connection from C1 to O1 • Critical path will have 0 slack • Admittedly small example
Sequentialization • Adding time slots allows for potential increase in hardware efficiency • This comes at the cost of increased latency • Adding slack allows better balance. • L=4 NA = 2 (4 or 3 contexts)
Full ASCII -> Hex Circuit • Logically three levels of dependence * Single Context: 21 LUTs @ 880Kl2=18.5Ml2
Time-multiplexed version • Three contexts: 12 LUTs @ 1040Kl2=12.5Ml2 • Pipelining need for dependent paths.
Context Optimization • With enough contexts only one LUT needed. Leads to poor latency. • Increased LUT area due to additional stored configuration information • Eventually additional interconnect savings taken up by LUT configuration overhead Ideal = perfect scheduling spread + no retime overhead
General Throughput Mapping • Useful if only limited throughput is desired. • Target produces new result every t cycles (e.g. a t LUT path) • Spatially pipeline every t stages. • Cycle = t • Retime to minimize register requirement • Multi-context evaluation within a spatial stage. Retime to minimize resource usage • Map for depth, i, and contexts, C
Dharma Architecture (UC Berkeley) • Allows for levelized circuit to be executed • Design parameters - #DLM • K -> number of DLM inputs • L -> number of levels
Levelization of Circuit • Levelization performed on basis od dependency graph. • Functions implemented as 3 input LUTs
Summary • Multiple contexts can be used to combat wire inactivity and logic latency • Too many contexts lead to inefficiencies due to retiming registers and extra LUTs • Architectures such as DPGA and Dharma address these issues through contexts • Run-time system needed to handle dynamic reconfiguration.