1 / 24

Overview

ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon. Overview. Motivation for Dynamic Reconfiguration Limitations of Reconfigurable Approaches Hardware support for reconfiguration Examples application ASCII to Hex conversion (Dehon).

genera
Download Presentation

Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 697FReconfigurable ComputingLecture 17Dynamic Reconfiguration IAcknowledgement: Andre DeHon

  2. Overview • Motivation for Dynamic Reconfiguration • Limitations of Reconfigurable Approaches • Hardware support for reconfiguration • Examples application • ASCII to Hex conversion (Dehon)

  3. Context Identifier Address Inputs (Inst. Store) . . . Computation Unit (LUT) . . . in Programming may differ for each element out DPGA • Configuration selects operation of computation unit • Context identifier changes over time to allow change in functionality • DPGA – Dynamically Programmable Gate Array

  4. Computations that Benefit from Reconfiguration • Low throughput tasks • Data dependent operations • Effective if not all resources active simultaneously. Possible to time-multiplex both logic and routing resources. A F0 F1 F2 B Non-pipelined example

  5. Resource Reuse • Example circuit: Part of ASCII -> Hex design • Computation can be broken up by considering this a data flow graph.

  6. Resource Reuse • Resources must be directed to do different things at different times through instructions. • Different local configurations can be thought of as instructions • Minimizing the number and size of instructions a key to successfully achieving efficient design. • What are the implications for the hardware?

  7. Actxt80Kl2 dense encoding Abase800Kl2 Previous Study (DeHon) Interconnect Mux Logic Reuse • * Each context no overly costly compared to base cost of wire, switches, IO circuitry Question: How does this effect scale?

  8. Exploring the Tradeoffs • Assume ideal packing: Nactive = Ntotal/L • Reminder: c*Actxt = Abase • Difficult to exactly balance resources/demands • Needs for contexts may vary across applications • Robust point where critical path length equals # contexts.

  9. Implementation #1 Implementation #2 NA= 3 NA= 4 Implementation Choices • Both require same amount of execution time • Implementation #1 more resource efficient.

  10. Scheduling Limitations • NA = size of largest stage in terms of active LUTs • Precedence -> a LUT can only be evaluated after predecessors have been evaluated. • Need to assign design LUTs to device • LUTs at specific contexts. • Consider formulation for scheduling. What are the choices?

  11. Scheduling • ASAP (as soon as possible) • Propagate depth forward from primary inputs • Depth = 1 + max input length • ALAP (as late as possible) • Propagate distance from outputs backwards towards inputs • Level = 1 + max output consumption level • Slack • Slack = L + 1 – (depth + level) • PI depth = 0, PO level = 0

  12. Slack Example • Note connection from C1 to O1 • Critical path will have 0 slack • Admittedly small example

  13. Sequentialization • Adding time slots allows for potential increase in hardware efficiency • This comes at the cost of increased latency • Adding slack allows better balance. • L=4 NA = 2 (4 or 3 contexts)

  14. Full ASCII -> Hex Circuit • Logically three levels of dependence * Single Context: 21 LUTs @ 880Kl2=18.5Ml2

  15. Time-multiplexed version • Three contexts: 12 LUTs @ 1040Kl2=12.5Ml2 • Pipelining need for dependent paths.

  16. Context Optimization • With enough contexts only one LUT needed. Leads to poor latency. • Increased LUT area due to additional stored configuration information • Eventually additional interconnect savings taken up by LUT configuration overhead Ideal = perfect scheduling spread + no retime overhead

  17. General Throughput Mapping • Useful if only limited throughput is desired. • Target produces new result every t cycles (e.g. a t LUT path) • Spatially pipeline every t stages. • Cycle = t • Retime to minimize register requirement • Multi-context evaluation within a spatial stage. Retime to minimize resource usage • Map for depth, i, and contexts, C

  18. Dharma Architecture (UC Berkeley) • Allows for levelized circuit to be executed • Design parameters - #DLM • K -> number of DLM inputs • L -> number of levels

  19. Example Dharma Circuit

  20. Levelization of Circuit • Levelization performed on basis od dependency graph. • Functions implemented as 3 input LUTs

  21. Detailed View of Dharma

  22. Example: DPGA Prototype

  23. Example: DPGA Area

  24. Summary • Multiple contexts can be used to combat wire inactivity and logic latency • Too many contexts lead to inefficiencies due to retiming registers and extra LUTs • Architectures such as DPGA and Dharma address these issues through contexts • Run-time system needed to handle dynamic reconfiguration.

More Related