210 likes | 338 Views
Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling. Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins Presented By: Nikhil Bansal. Outline. Introduction coarse-grained reconfigurable architectures
E N D
Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins Presented By: Nikhil Bansal
Outline • Introduction • coarse-grained reconfigurable architectures • core problem: exploiting parallelism • modulo scheduling problem • Compiler Framework • Modulo Scheduling Algorithm • Conclusions and Future Work
Example of Coarse-Grained Architectures: MorphoSys Topology of MorphoSys Architecture of a Reconfigurable Cell Ming-Hau Lee et al., University of California, Irvine Other examples: REMARC, PACT, Chameleon, KressArray, QuickSilver ...
Core Problem: Exploiting Parallelism Which parallelism makes difference? • Instruction-level parallelism • limited parallelism (constrained by dependence) • VLIW does a good job • Task(thread)-level parallelism • hard to automate • lack support in coarse-grained architectures • Loop-level parallelism (pipelining) • fit into coarse-grained architectures • higher parallelism than ILP
Pipelining Using Modulo Scheduling Modulo Scheduling (general): • A way of pipelining • Iterations are overlapped • Each iteration is initiated at a fixed interval (II) For coarse-grained architectures: • Where to place an operation? (placement) • When to schedule an operation? (scheduling) • How to connect operations? (routing) • Modulo constraints
Iteration 1 fu3 fu4 fu2 fu1 t=0 n1 prologue t=1 n1 n2 n3 steady state (kernel) t=2 n1 n2 n4 n3 t=3 n3 n2 n4 epilogue t=4 n4 II = 1 Pipeline stages = 3 4 operations/cycle for kernel n2 n4 b) space-time representation n3 n1 Modulo Scheduling Problem (cont.) dataflow graph n1 n2 n3 n4 2x2 matrix fu1 fu2 fu3 fu4 a) an example
Outline • Introduction • Compiler Framework • structure of compiler • architecture description and abstraction • Modulo Scheduling Algorithm • Conclusion and Future Work
The Structure of DRESC Compiler C program Architecture Description DRESC refers to Dynamically Reconfigurable Embedded Systems Compiler IMPACT Frontend Architecture Parser Lcode IR Architecture Abstraction Dataflow Analysis & Transformation Modulo Scheduling Algorithm External tool Simulator Under development
muxb muxc muxa out1 out2 src1 src2 pred FU RF pred _dst1 in pred_dst2 dst1 Configuration RAM reg The Target Architecture Template Example of FU and register file Examples of topology • Generalizing common features of other architectures • Using an XML-based language to specify topology, resource allocation, operations and timing
Architecture Description and Abstraction XML-based architecture description Specify resource allocation, operation binding, topology and timing. Architecture Parser Architecture Abstraction Modulo Routing Resource Graph (MRRG) abstracts architecture for modulo scheduling. It combines features of: • Modulo reservation table (MRT) from VLIW compilation • Routing resource graph from FPGA P&R MRRG representation
Definitions of MRRG MRRG is defined as 3-tuple: G = {V, E, II} v = (r, t), r is related to resource, t refers to time stamp E = {(vm, vn) | t(vm) <= t(vn)} II = initiation interval Important properties: • modulo: if node (r, tj) is used, all the nodes {(r, tk)| tj mod II = tk mod II} are used too • asymmetric: no route from vi to vj, if t(vi) > t(vj) Modulo scheduling problem is transformed to a placement and routing (P&R) problem on MRRG
pred src1 src2 pred src1 src2 FU source sink pred _dst1 pred _dst2 dst pred_dst1 pred_dst2 dst in cap cycle1 cycle1 in RF out1 out2 out1 out2 in cycle2 cap out1 out2 Transform Components to MRRG Register allocation is transformed to part of P&R problem, implicitly solved by the modulo scheduling algorithm. register modeling is based on Roos2001
Outline • Introduction • Compiler Framework • Modulo Scheduling Algorithm • combined placement and routing • congestion negotiation • simulated annealing • results and related work • Conclusions and Future Work
Init Placement &Routing n1 ? Rip-Up op Re-placement n2 Routing For normal FPGA P&R: No switch block Success? LUT1 Yes LUT LUT2 Combined Placement and Routing Space-time routing resource graph can’t guarantee routability during the placement
Proposed Algorithm InitTemperature • Sort the operations • For each II, first generate initial schedule which respects dependency constraints only. • The algorithm iteratively reduces resource overuse and tries to come up with a legal schedule • At every iteration, an operation is ripped up from the existing schedule and is placed randomly • Connected nets are rerouted accordingly • A cost function (next slide) is computed to evaluate the new placement and routing • Simulated annealing strategy is used to decide whether we accept new placement or not. Init P&R, Penalty Rip-Up op Re-P&R op Evaluate New P&R Yes Accept? No Restore op UpdateTemperature UpdatePenalty No Success? Yes
Cost Function Allow to overuse resources during P&R The cost of using one node is computed as follow: InitPenalty Init P&R Rip-Up op base: base cost of the node in MRRG occ: occupancy cap: capacity of the node p: penalty factor Re-placement Routing UpdatePenalty penalty is increased over time as follow: No Success? Yes
Parameters to Tune the Algorithm • Ordering of operations • techniques from Llosa2001 • Relaxing factor of schedule length • difficulty of moving operations VS. more pipeline stages • Parameters of SA algorithm • Costs associated with different resources • register file get less base cost • Penalty factor associated with overused resources • compromise between scheduling quality and speed • ...
Scheduling Results Scheduling results on a 8x8 matrix resembles topology of Morphosys Algorithm limitations: • Scheduling speed is relatively slow • Scheduling quality still has space to improve • Can’t handle pipelined FUs • Can only handle the inner loop of a loop nest
Related Work • Modulo scheduling on clustered VLIWs • problem is simpler in nature (no routing). • RaPiD, Garp • row-based architecture and scheduling techniques. • no multiplexing • PipeRench • ring-like architecture is very specific, scheduling techniques are not general • Z. Huang, S. Malik, DAC2002 • either use a full cross-bar, or generate a dedicated datapath for several loops for pipelining
Outline • Introduction • Compiler Framework • Exploiting Parallelism • Conclusions and Future Work
Conclusions and Future Work • Conclusions: • Coarse-grained architectures have distinct features. Compilers are possible and needed • Loop-level parallelism is the right one for coarse-grained reconfigurable architectures • A novel modulo scheduling algorithm and an abstract architecture representations are developed • Future Work: • Improve quality and speed of scheduling algorithm • Enlarge the scope of pipelineable loops • Techniques to reduce the bottleneck of pipelineable loops, e.g., taking into account of distributed memory