Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins Presented By: Nikhil Bansal

Outline • Introduction • coarse-grained reconfigurable architectures • core problem: exploiting parallelism • modulo scheduling problem • Compiler Framework • Modulo Scheduling Algorithm • Conclusions and Future Work

Example of Coarse-Grained Architectures: MorphoSys Topology of MorphoSys Architecture of a Reconfigurable Cell Ming-Hau Lee et al., University of California, Irvine Other examples: REMARC, PACT, Chameleon, KressArray, QuickSilver ...

Core Problem: Exploiting Parallelism Which parallelism makes difference? • Instruction-level parallelism • limited parallelism (constrained by dependence) • VLIW does a good job • Task(thread)-level parallelism • hard to automate • lack support in coarse-grained architectures • Loop-level parallelism (pipelining) • fit into coarse-grained architectures • higher parallelism than ILP

Pipelining Using Modulo Scheduling Modulo Scheduling (general): • A way of pipelining • Iterations are overlapped • Each iteration is initiated at a fixed interval (II) For coarse-grained architectures: • Where to place an operation? (placement) • When to schedule an operation? (scheduling) • How to connect operations? (routing) • Modulo constraints

Iteration 1 fu3 fu4 fu2 fu1 t=0 n1 prologue t=1 n1 n2 n3 steady state (kernel) t=2 n1 n2 n4 n3 t=3 n3 n2 n4 epilogue t=4 n4 II = 1 Pipeline stages = 3 4 operations/cycle for kernel n2 n4 b) space-time representation n3 n1 Modulo Scheduling Problem (cont.) dataflow graph n1 n2 n3 n4 2x2 matrix fu1 fu2 fu3 fu4 a) an example

Outline • Introduction • Compiler Framework • structure of compiler • architecture description and abstraction • Modulo Scheduling Algorithm • Conclusion and Future Work

The Structure of DRESC Compiler C program Architecture Description DRESC refers to Dynamically Reconfigurable Embedded Systems Compiler IMPACT Frontend Architecture Parser Lcode IR Architecture Abstraction Dataflow Analysis & Transformation Modulo Scheduling Algorithm External tool Simulator Under development

muxb muxc muxa out1 out2 src1 src2 pred FU RF pred _dst1 in pred_dst2 dst1 Configuration RAM reg The Target Architecture Template Example of FU and register file Examples of topology • Generalizing common features of other architectures • Using an XML-based language to specify topology, resource allocation, operations and timing

Architecture Description and Abstraction XML-based architecture description Specify resource allocation, operation binding, topology and timing. Architecture Parser Architecture Abstraction Modulo Routing Resource Graph (MRRG) abstracts architecture for modulo scheduling. It combines features of: • Modulo reservation table (MRT) from VLIW compilation • Routing resource graph from FPGA P&R MRRG representation

Definitions of MRRG MRRG is defined as 3-tuple: G = {V, E, II} v = (r, t), r is related to resource, t refers to time stamp E = {(vm, vn) | t(vm) <= t(vn)} II = initiation interval Important properties: • modulo: if node (r, tj) is used, all the nodes {(r, tk)| tj mod II = tk mod II} are used too • asymmetric: no route from vi to vj, if t(vi) > t(vj) Modulo scheduling problem is transformed to a placement and routing (P&R) problem on MRRG

pred src1 src2 pred src1 src2 FU source sink pred _dst1 pred _dst2 dst pred_dst1 pred_dst2 dst in cap cycle1 cycle1 in RF out1 out2 out1 out2 in cycle2 cap out1 out2 Transform Components to MRRG Register allocation is transformed to part of P&R problem, implicitly solved by the modulo scheduling algorithm. register modeling is based on Roos2001

Outline • Introduction • Compiler Framework • Modulo Scheduling Algorithm • combined placement and routing • congestion negotiation • simulated annealing • results and related work • Conclusions and Future Work

Init Placement &Routing n1 ? Rip-Up op Re-placement n2 Routing For normal FPGA P&R: No switch block Success? LUT1 Yes LUT LUT2 Combined Placement and Routing Space-time routing resource graph can’t guarantee routability during the placement

Proposed Algorithm InitTemperature • Sort the operations • For each II, first generate initial schedule which respects dependency constraints only. • The algorithm iteratively reduces resource overuse and tries to come up with a legal schedule • At every iteration, an operation is ripped up from the existing schedule and is placed randomly • Connected nets are rerouted accordingly • A cost function (next slide) is computed to evaluate the new placement and routing • Simulated annealing strategy is used to decide whether we accept new placement or not. Init P&R, Penalty Rip-Up op Re-P&R op Evaluate New P&R Yes Accept? No Restore op UpdateTemperature UpdatePenalty No Success? Yes

Cost Function Allow to overuse resources during P&R The cost of using one node is computed as follow: InitPenalty Init P&R Rip-Up op base: base cost of the node in MRRG occ: occupancy cap: capacity of the node p: penalty factor Re-placement Routing UpdatePenalty penalty is increased over time as follow: No Success? Yes

Parameters to Tune the Algorithm • Ordering of operations • techniques from Llosa2001 • Relaxing factor of schedule length • difficulty of moving operations VS. more pipeline stages • Parameters of SA algorithm • Costs associated with different resources • register file get less base cost • Penalty factor associated with overused resources • compromise between scheduling quality and speed • ...

Scheduling Results Scheduling results on a 8x8 matrix resembles topology of Morphosys Algorithm limitations: • Scheduling speed is relatively slow • Scheduling quality still has space to improve • Can’t handle pipelined FUs • Can only handle the inner loop of a loop nest

Related Work • Modulo scheduling on clustered VLIWs • problem is simpler in nature (no routing). • RaPiD, Garp • row-based architecture and scheduling techniques. • no multiplexing • PipeRench • ring-like architecture is very specific, scheduling techniques are not general • Z. Huang, S. Malik, DAC2002 • either use a full cross-bar, or generate a dedicated datapath for several loops for pipelining

Outline • Introduction • Compiler Framework • Exploiting Parallelism • Conclusions and Future Work

Conclusions and Future Work • Conclusions: • Coarse-grained architectures have distinct features. Compilers are possible and needed • Loop-level parallelism is the right one for coarse-grained reconfigurable architectures • A novel modulo scheduling algorithm and an abstract architecture representations are developed • Future Work: • Improve quality and speed of scheduling algorithm • Enlarge the scope of pipelineable loops • Techniques to reduce the bottleneck of pipelineable loops, e.g., taking into account of distributed memory

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Presentation Transcript

Rudy-shop.com

Rudy Van Gelder

Rudy Giuliani

Rudy Fernandez

General Presentation on IMEC’s Thematic Design Activities Ivo Bolsens, Hugo De Man

Serge Prokofieff

General Presentation on IMEC’s Thematic Design Activities Ivo Bolsens, Hugo De Man