240 likes | 589 Views
Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures. Hyunchul Park, Kevin Fan, Manjunath Kudlur, Scott Mahlke. Advanced Computer Architecture Lab University of Michigan. Coarse-Grained Reconfigurable Architecture (CGRA).
E N D
Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke Advanced Computer Architecture Lab University of Michigan 1
Coarse-Grained Reconfigurable Architecture (CGRA) • Array of PEs connected in a mesh-like interconnect • Characterized by array size, node functionalities, interconnect, register file configurations • Execute compute intensive kernels in multimedia applications Config FU LRF 2
CGRA : Attractive Alternative to ASICs • Suitable for running multimedia applications on embedded systems • High computation throughput • Low power consumption and scalability • High flexibility with fast configuration • Morphosys : 8x8 array with RISC processor • SIMD style execution of loops • Piperench : 1-D reconfigurable hardware • Virtualize hardware pipeline • ADRES : 8x8 array with tightly coupled VLIW • Modulo scheduling with simulated annealing 3
Scheduling in CGRA • Different from conventional VLIW • Sparse interconnect and distributed register files • No dedicated routing resources • Need a good compiler to exploit the abundance of computing resources FU0 LRF FU1 LRF CentralRF FU0 FU1 FU2 FU3 FU2 LRF FU3 LRF CGRA Conventional VLIW 4
Objectives of This Work • Modulo scheduling technique for CGRAs • Exploit loop-level parallelism by overlapping execution of iterations • Targeting low-cost CGRAs • Achieve quality schedule under restriction of hardware • Fast compilation time 5
A A A A A A B B B B B B C C C C C C Modulo Scheduling Basics • Expose loop-level parallelism by overlapping execution of iterations • Initiation interval (II) • Each iteration is executed every II cycles II Overlapped Execution 6
DFG Modulo Scheduling for CGRA • Mapping DFG onto 3-D scheduling space • Limited number of scheduling slots : (number of PEs) x II • Minimize routing cost (number of slots used for routing) • Sparse interconnect and distributed register files • Ensure routability of operands II time Scheduling Space 4x4 CGRA 7
Our Approach • Systematic approach to generate good schedule in reasonable time • Minimize routing cost • Convert scheduling problem into graph embedding • Leverage graph embedding algorithm • Ensure routability of operands • Skewed scheduling space • Create a narrow, but tall scheduling space 8
1 : Minimize Routing Cost • Routing cost : number of PEs used for routing • Determined by positions of producer and consumer • Minimize distance between producers and consumers • Height-based list scheduling • Schedule operations in the order of dependence height • Place consumers close to producers • Need to carefully place operations in the same height 9
PE 0 PE 1 PE 2 PE 3 Scheduling Example – Routing Cost 0 1 2 3 0 1 2 3 4 5 4’ 5’ 4 5 6 6 Routing Cost = 2 DFG 0 1 2 3 4 5 6 1x4 CGRA Routing Cost = 0 Common consumer information is important ! 10
Affinity Graph Heuristic • Consider placement of operations with same height together • Use common consumer information • Affinity value between operations • Measured by the distance of common consumers in DFG • Construct affinity graph • Nodes : operations, edges : affinity values • Place operations with affinity edges close to each other 11
0 4 1 2 3 5 0 1 2 0 2 4 PE PE PE PE 1 3 5 3 4 5 PE PE PE PE Affinity Graph Example 0 1 2 3 4 5 height 3 height 2 height 1 Affinity Graph DFG Mapping onto CGRA 2x4 CGRA Drawing affinity graph onto scheduling space Bad mapping Good mapping 12
Leveraging Graph Embedding • Graph embedding • Drawing a graph onto a target space • Grid layout algorithm by Li & Kurata • Embed complicated biochemical networks onto 2-D grid space • Simulated annealing • Our scheduling problem is a graph embedding problem • Draw affinity graph onto scheduling space minimizing edge length Process Flow of Grid Layout [Li 2005] 13
0 1 2 3 4 PE 0 PE 1 PE 2 5 6 7 0 1 2 3 4 5 6 2 : Ensure Routability of Operands • Resources are repeatedly used every II cycles • Routing can fail due to previously scheduled operations • Backtracking : hard to make forward progress for CGRA • Take preventative approach 0 1 2 II 3 4 5 6 1x3 CGRA 7 DFG Routing failed for Op 7 ! 14
0 5 6 0 1 2 1 2 7 3 4 0 5 6 0 1 2 1 2 7 3 4 Skewed Scheduling Space • Should prevent routing failures in advance • Skew scheduling space • Staggering down to the right • Create a narrow, but tall scheduling space • Operations can be routed to the right • Dynamically adjust scheduling space 15
System Flow 16
Experimental Setup • Twelve innermost loop kernels from various domains • Three designs with different RF configurations • Evaluate the impact of register file sharing Dedicated RF Shared RF Central RF 17
Evaluation of Affinity Heuristic • Results of acyclic scheduling • Average of 59% reduction in routing cost 18
Modulo Graph Embeddingvs. Simulated Annealing • Utilization = (# slots used for computation) / (# total slots) • Time : (~ 5 sec) vs. (5 min ~ 3 hours) 19
Conclusions • Modulo scheduler targeting low-cost CGRAs • Provide high computation throughput, scalability, power efficiency • Two heuristics to generate a good schedule • Affinity graph heuristic • Skewed scheduling space • Average utilizations of 56-68% for three designs • Systematic approach allows fast compilation time • All benchmarks finished within 5s 21
Questions ? 22