1 / 44

Reconfigurable Computing

Reconfigurable Computing. Dr. Christophe Bobda CSCE Department University of Arkansas. Chapter 4 High-Level Synthesis for Reconfigurable Devices. Agenda. Modeling Dataflow graphs Sequencing graphs Finite State Machine with Datapath High-level synthesis and Temporal partitioning

richert
Download Presentation

Reconfigurable Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reconfigurable Computing • Dr. Christophe Bobda • CSCE Department • University of Arkansas

  2. Chapter 4High-Level Synthesis for Reconfigurable Devices

  3. Agenda • Modeling • Dataflow graphs • Sequencing graphs • Finite State Machine with Datapath • High-level synthesis and Temporal partitioning • The List-Scheduling-based approach • Exact methods: Integer Linear Programming • Network-flow-based TP

  4. 1. Modeling • The success of a hardware platform depends on the easiness of programming • Micro processors • DSPs • High-level descriptions increase the acceptance of a platform • Modeling is a key aspect in the design of a system • Models used most be powerful enough • Capture all user’s need • Describe all system parameters • Easy to understand and manipulate • Several powerful models exists (FSM, State Charts, DFG, Petri Nets, etc…) • We focus in this course on two models: Dataflow graphs, sequencing graphs and the Finite State Machine with Datapath(FSMD)

  5. 1. Definitions • Given a node and its implementation as rectangular shape module in hardware. • denotes the length and the height of • denotes the area • The latency of is the time it takes to compute the function of using the module • For a given edge , is the weight of , i.e the width of bus connecting two components and . • The latency of is the time needed to transmit data from to • Simplification: we will just use for a node as well as for its hardware implementation .

  6. 1.1 Dataflow Graph • Means to describe a computing task in a streaming mode. • Operators represent the nodes • Inputs of the nodes are the operand • Node’s output can be used as input to other nodes • Data dependency in the graph. • Given a set of tasks a dataflow graph (DFG) is a directed acyclic graph , where • is the set of nodes representing operators and • E is the set of edges: An edge is defined through the (data) dependency between task and task . Dataflow graph for the quadratic root computation using:

  7. 1.2 Sequencing graph • Hierarchical dataflow graph with two different types of nodes: • The operation nodes corresponding to normal ”task nodes” in a dataflow graph • The link nodes or branching nodes that point to another sequencing graph in a lower level of the hierarchy. • Linking nodes evaluate conditional clauses. • placed at the tail of alternative paths corresponding to possible branches. • Loops can be modeled by using a branching as a tail of two paths • one for the exit from the loop • the other for the return to the body of the loop

  8. 1.2 Sequencing graph • Example: • According to the conditions that node BR evaluates, one of the two subsequencing graphs (1 or 2) can be activated. • Loop description: body of the loop is described in only one sub-sequencing graph: BR evaluates the loop exit condition

  9. 1.3 Finite State Machine with Datapath (FSMD) • Extension of dataflow graph with a finite state machine • Formally, An FSMD is a 7-tuple where: • is a set of states, • is a set of inputs, • is a set of outputs, • is a set of variables, • is a transition function that maps a tupel (state, input variable, output variable) to a state, • is an action function that maps the current state to output and variable, • is an initial state. • FSMD vs FSM • FSMD operates on arbitrary complex data type • Transition may include arithmetic operations

  10. 1.3 Finite State Machine with Datapath (FSMD) • Modeling with FSMD: • The transformation of a program into a FSMD is done by • Transforming the statements of the program into FSMD states. • The statements are first classified in three categories: • the assignment statements, • the branch statements and • the loop statements • For an assignment statement • a single state is created that executes the assignment action. • An arc connecting the so created state with the state corresponding to the next program statement is created.

  11. 1.3 Finite State Machine with Datapath (FSMD) • For a loop statement • a condition state C and a join state J, both with no action are created. • An arc, labeled with the loop condition and connecting the conditional state C with the state corresponding to the first statement in the loop body is added to the FSMD. • Accordingly another arc is added from C to the state corresponding to the first statement after the loop body. • This arc is labeled with the complement of the loop condition. • Finally an edge is added from the state corresponding to the last statement in the loop to the join state and • another edge is added from the join state back to the conditional state.

  12. 1.3 Finite State Machine with Datapath (FSMD) • For a branch statement, • a condition state C and a join state J both with no action are created. • An arc is added from the conditional state to the first statement of the branch. • This branch is labeled with the first branch condition. • A second arc, labeled with the complement of the first condition ANDed with the second branch condition is added from the conditional state to the first statement of the branch. • This process is repeated until the last branch condition. • Each state corresponding to the last statement in a branch is then connected to the join state. • The join state is finally connected to the state corresponding to the first statement after the branch.

  13. 1.3 Finite State Machine with Datapath (FSMD)

  14. 2. High-Level synthesis • Next: compilation (synthesis) onto a set of hardware components • Usually done in three different steps. • The allocation defines the number of resource types required by the design, and for each type the number of instances. • The binding maps each operation to an instance of a given resource. • The scheduling sets the temporal assignment of resources to operators • Fundamental difference in reconfigurable computing • Only Binding and scheduling are important common high-level synthesis. • Resources can be create as wanted in RC

  15. 2. High-Level synthesis • Dataflow graph of the functions and • Assumptions on “resource fixed” device • Multiplication needs 100 basic resource (ex 100 NAND-Gates or 100 LUTs). • The adder and the subtractor need 50 LUTs each. • Despite the availability of resources, two subtractors cannot be used in the first level • The adder cannot be used in the first step, • it depends on a subtractor that must be first executed • Minimum execution time is 4 steps

  16. 2. High-Level synthesis • Dataflow graph of the functions and • Assumptions on a reconfigurable device • Multiplication needs 100 basic resource (ex 100 NAND-Gate or 100 LUTs). • The adder and the subtractor need 50 LUTs each. • Total available amount of resources: 200 LUTs. • The two subtractors can be assigned in the first step. • Minimum execution time: 3 steps

  17. 2. Temporal partitioning – Problem definition • Temporal partitioning: Given: • Single device temporal partitioning of a DFG G=(V,E) for a device R • A temporal partition can also be defined as an ordered partition of G with the constraints imposed by R. • With the ordering relation imposed on the partition, we reduce the solution space to only those partitions which can be scheduled on the device for execution. • Therefore, cycles are not allowed in the dataflow graph. Otherwise, the resulting partition may not be schedulable on the device Cycle

  18. 2. Temporal partitioning • Goal: • Computation and scheduling of a Configuration graph • A configuration graphis a graph in which: • Nodes are partitions or bitstreams • Edges reflect the precedence constraints in a given DFG • Communication throughinter-configuration registers • configuration sequence is controlled by the host processor • On configuration, saved register values. • After reconfiguration, copy values back to registers P2 P3 P1 P4 P5 Inter-configuration registers A configuration graph IO Register IO Register Bus IO Register Block IO Register IO Register IO Register IO Register IO Register Processor FPGA Device’s register mapping into the processor address spaces

  19. 8 2 1 3 7 9 4 6 5 10 Connectivity = 0.24 3 2 6 9 1 7 4 8 5 10 Quality = 0.25 8 2 1 7 9 3 4 6 5 10 Quality = 0.45 2. Temporal partitioning • Objectives: The following objectives are followed: • Minimize the number of interconnection: This is one of the most important, since it will minimize : • the amount of exchanged data • The amount of memory for temporaly storing the data • Minimize the number of produced blocks • Minimize the overal computation delay • Quality of the result: means to measure how good an algorithm performs • Connectivity of a graph G=(V,E):con(G) = 2*|E|/(|V|2 - |V|) • Quality of Partitioning P = {P1,…,Pn}: Average connectivity over P • High quality means the algorithm performs well. • Low quality means that the algorithm performs poor.

  20. 2. Temporal partitioning • Scheduling: Given is a DFG and an architecture which is a set of processing element • Compute the starting time of each node on a given resource • Temporal partitioning: Given is a DFG and an a reconfigurable device • The starting time of each node is the starting time of the partition to which it belongs • Compute the starting time of each node onto the device • Solution approaches: • List scheduling • Integer Linear Programming • Network Flow • Spectral method

  21. 2.1 Unconstrained Scheduling • ASAP (as soon as possible) • Defines the earliest starting time for each node in the DFG • Compute a minimal latency • ALAP (as late as possible) • Defines the latest starting time for each node in the DFG according to a given latency • The mobility of a node ist difference betwen the ALAP-starting time and ASAP-starting time • Mobility is 0  node is on a critical path

  22. Zeit 0 * * * * + + * * < - Zeit 3 - Zeit 3 Zeit 4 Zeit 4 2.1 ASAP Example • Unconstrained scheduling with optimal latency : L = 4 Time 0 Time 1 Time 2 Time 3 Time 4

  23. 2.1 ASAP Algorithm ASAP(G(V,E),d) { FOREACH( vi without predecessor) •  s(vi):= 0; REPEAT { choose a node vi , which predecessors are all planed; s(vi) := maxj:(vj,vi)E {s(vj)+ dj}; } UNTIL (all nodes vi are planned); RETURN s }

  24. * * Zeit 1 * * - * * + Zeit 3 + < - Zeit 4 Zeit 4 2.1 ALAP-Example • Unconstrained scheduling with optimal latency : L = 4 Time 0 Time 1 Time 2 Time 3 Time 4

  25. Zeit 0 0 0 * + * * * Zeit 1 1 0 + < * * * Zeit 2 1 2 0 2 * - + * Zeit 3 2 2 0 + < - Zeit 4 2.1 Mobility Time 0 Time 1 Time 2 Time 3 Time 4

  26. 2.1 ALAP-Algorithm ALAP(G(V,E),d, L) { FOREACH( vi without successor) s(vi) := L - di; REPEAT { Choose a node vi , which successors are all planed; s(vi) := minj:(vi,vj)E {s(vj)} - di; } UNTIL (all nodes vi are planned); RETURN s }

  27. 2.1 Constrained scheduling • Extended ASAP, ALAP • Compute ASAP or ALAP • Assign the tasks earlier (ALAP) or later (ASAP), untill the resource constraints are fullfiled • Listscheduling • A list L of ready to run tasks is created • Tasks are placed in L in decreasing priority order • At a given step, the free resource is assigned the task with highest priority. • Criteria can be: number of successors, mobility, connectivity, etc.

  28. * * Time 1 + * * Time 2 < - * * Time 3 + - Time 4 2.1 Extended ASAP, ALAP • 2 Multiplizierer, 2 ALUs (+, , <) Time 0

  29. 3 3 2 1 1 * * * * + 2 1 0 0 + * * < 1 - 0 - 2.1 Constrained scheduling • Criterium: number of successors • Ressource: 1 multiplier, 1 ALU (+, , <)

  30. Time 0 + * Time 1 * < Time 2 * Time 3 * - Time 4 * Time 5 - * Time 6 + Time 7 2.1 Constrained scheduling

  31. 2.1 Temporal partitioning vs constrained scheduling • List scheduling is used to keep an ordering between the nodes • List Scheduling (LS) partitioning • Construct a list L of all nodes with priority • Create a new empty partition Pact • Remove a node from the list and place it in the partition • If size(Pact) <= size(R) and T(Pact) <= T(R) go to 2.1, else go to 2.3 • If empty(list), stop else goto 2

  32. 3 3 2 1 1 * * * * + 2 1 0 0 + * * < 1 - 0 - 2.1 Temporal partitioning vs constrained scheduling • Criterium: number of successors • size(FPGA) = 250, size (mult) = 100, size(add) = size(sub) = 20 • size(comp) = 10

  33. + * * P1 3 3 2 1 1 * * * * + 2 1 0 0 < + * * < 1 - * * 0 - P2 - P3 * * - + 2.1 Temporal partitioning vs constrained scheduling • Connectivity: c(P1) = 1/6, c(P2) = 1/3, c(P3) = 2/6 • Quality: 0.83

  34. * + * 3 3 2 1 1 * * * * + P1 2 1 0 0 + < + * * < * 1 - * 0 P2 - - * P3 * - 2.1 Temporal partitioning vs constrained scheduling • Connectivity: c(P1) = 2/10, c(P2) = 2/3, c(P3) = 2/3 • Quality: 1.2 • Connectivity is better

  35. * + / * - * + - / 2.1 Temporal partitioning vs constrained scheduling • ASAP • Processing in topological order • Assigning level-number to the nodes • Scheduling of nodes according to level number • Drawback • “Levelization”: Assigned of nodes to partitions based level-number (increase data exchange) • Advantage • Fast (linear run-time) • Local optimization possible Level 0 Level 1 Level 2 Level 3

  36. 2.1 Temporal partitioning vs constrained scheduling • Local optimization by configuration switching • If two consecutive partitions P1 and P2 share a common set of operators, then: • Implement the minimal set of operators needed for the two partitions • Use signal multiplexing to switch from one partition to the next one. • Drawbacks: More resources are needed to implement the signals switching • Advantages: • reconfiguration time is reduced • Device's operation is not interrupted

  37. d f b g c h e b c d e a Mult Add Add Mult Add h Add j Sub g a i i Add f Sub Sub Mult j 2.1 Temporal partitioning vs constrained scheduling Configuration 1 Inter configuration register Configuration 2

  38. d f b g c h e b c d e a Mult Add Add Mult Add h Add j Sub g a i i Add f Sub Sub Mult j 2.1 Temporal partitioning vs constrained scheduling Configuration 1 Inter configuration register Configuration 2

  39. d f b g c h e b c d e a Mult Add Add Mult Add h Add j Sub g a i i Add f Sub Sub Mult j 2.1 Temporal partitioning vs constrained scheduling Configuration 1 Inter configuration register Configuration 2

  40. 2.1 Temporal partitioning vs constrained scheduling • Improved List Scheduling algorithm • Generate the list of nodes node_list • Build a first partition P1 • while(!node_list.empty( )) • build a new partition P2 • If union (P1, P2) fits on the device, then implement configuration switching with P1 and P2 • else set P1 = P2 and goto 3 • Exit

  41. 2.2 Temporal partitioning – ILP • With the ILP (Integer Linear Programming), • the temporal partitioning constraints are formulated as equations. • The equations are then solved using an ILP-solver. • The constraints usually considered are: • Uniqueness constraint • Temporal order constraint • Memory constraint • Resource constraint • Latency constraint • Notations:

  42. 2.2 Temporal partitioning – ILP • Unique assignment constraint: Each task must be placed in exactly one partition. • Precedence constraint: for each edge in the graph, must be placed either in the same partition like or in a earlier partition than that in which is placed. • Resource constraint: The sum of the resource need to implement the modules in one partition should not exceed the total amount of available resources. • Device area constraint: • Device terminal constraints:

  43. 2.3 Temporal partitioning – Network-flow • Recursive bipartitioning • The goal at each step is the generation of a unidirectional bipartition • The goal at each step is to compute a bipartition wich minimizes the edge-cut size between the two partitions. • Network flow methods are used to compute the a bipartition with minimal edge-cut size. • Directly applying the min-cut max-flow theorem may leads to non-unidirectional cost. • Therefore, the original G is first transformed into a new graph G' in which each cut is unidirectional Unidirectional recursive bipartitioning A bidirectional cut

  44. 2.3 Temporal partitioning – Network-flow • Two-terminal net transformation • Replace an edge (v1, v2) with two edges (v1, v2) with capacity 1 and (v2, v1) with infinite capacity • Multi-terminal net transformation • For a multi-terminal net {v1, v2, .....v2}, • Introduce a dummy node v with no weight and a briging (v1, v) with capacity 1. • Introduces the egdes (v, v2), .... (v, vn), each of which is assigned a capacity 1. • Introduce the edges (v2, v1), ..., (vn, v1), each of which is assigned an infinite capacity • Having computed a min-cut in the transformed graph G, a min-cut can be derived in G: for each node of G' assigned to a partition, its counterpart in G is assigned to the corresponding partition in G.

More Related