350 likes | 374 Views
This paper discusses exploiting pseudo-schedules to guide data dependence graph partitioning in clustered architectures. It focuses on minimizing inter-cluster communication delays and leveraging communication locality in VLIW architectures.
E N D
PACT 2002, Charlottesville, Virginia – September 2002 Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning Alex Aletà Josep M. Codina Jesús Sánchez Antonio González David Kaeli {aaleta, jmcodina, fran, antonio}@ac.upc.es kaeli@ece.neu.edu
Clustered Architectures • Current/future challenges in processor design • Delay in the transmission of signals • Power consumption • Architecture complexity • Clustering: divide the system in semi-independent units • Each unit Cluster • Fast interconnects intra-cluster • Slow interconnects inter-clusters • Common trend in commercial VLIW processors • TI’s C6x • Analog’s TigerSHARC • HP’s LX • Equator’s MAP1000
Register Buses LOCAL REGISTER FILE LOCAL REGISTER FILE FU FU MEM FU FU MEM CLUSTER 1 CLUSTER n L1 CACHE Architecture Overview
Instruction Scheduling • For non-clustered architectures • Resources • Dependences • For clustered architectures • Cluster assignment • Minimize inter-cluster communication delays • Exploit communication locality • This work focuses on modulo scheduling for clustered VLIW architectures • Technique to schedule loops
Talk Outline • Previous work • Proposed algorithm • Overview • Graph partitioning • Pseudo-scheduling • Performance evaluation • Conclusions
Cluster Assignment Scheduling Cluster Assignment II++ + Scheduling • One step • There is no initial cluster assignment • The scheduler is free to choose any cluster II++ MS for Clustered Architectures • Two steps • Data Dependence Graph partitioning: each instruction is assigned to a cluster • Scheduling: instructions are scheduled in a suitable slot but only in the preassigned cluster • In previous work, two different approaches were proposed:
Goal of the Work • Both approaches have benefits • Two steps • Global vision of the Data Dependence Graph • Workload is better split among different clusters • Number of communications is reduced • One step • Local vision of partial scheduling • Cluster assignment is performed with information of the partial scheduling • Goal: obtain an algorithm taking advantage of the benefits of both approaches
Baseline • Baseline scheme: GP [Aletà et al., Micro34] • Cluster assignment performed with a graph partitioning algorithm • Feed-back between the partitioning and the scheduler • Results outperformed previous approaches • Still little information available for cluster assignment • New algorithm: better partition • Pseudo-schedules are used to guide the partition • Global vision of the Data Dependence Graph • More information to perform cluster assignment
Algorithm Overview Compute initial partition II:= MII Start scheduling Schedule Opj based on the current partition Refine Partition Able to schedule? Select next operation (j++) YES II++ NO Move Opj to another cluster NO Able to schedule? YES
Algorithm Overview Compute initial partition II:= MII Start scheduling Schedule Opj based on the current partition Refine Partition Able to schedule? Select next operation (j++) YES II++ NO Move Opj to another cluster NO Able to schedule? YES
Graph Partitioning Background • Problem statement • Split the nodes into a pre-determined number of sets and optimizing some functions • Multilevel strategy • Coarsen the graph • Iteratively, fuse pairs of nodes into new macro-nodes • Enhancing heuristics • Avoid excess load in any one set • Reduce execution time of the loops
Graph Coarsening • Previous definitions • Matching • Slack • Iterate until same number of nodes than clusters: • The edges are weighted according to • Impact on execution time of adding a bus delay to the edge • Slack of the edge • Then, select the maximum weight matching • Nodes linked by edges in the matching are fused in a single macro-node
Initial graph Final graph 4 4 1 2 4 4 2 4 4 Find matching Find matching Coarsening Example
coarsening Example (II) 1st STEP: Partition induced in the original graph Initial graph Induced Partition Final graph
Reducing Execution Time • Estimation of execution time needed Pseudo-schedules • Information obtained • II • SC • Lifetimes • Spills
Building pseudo-schedules • Dependences • Respected if possible • Else a penalty on register pressure and/or in execution time is assessed • Cluster assignment • Partition strictly followed
Induced partition A D B C Pseudo-schedule: example • 2 clusters, 1 FU/cluster, 1 bus of latency 1, II= 2 Instruction latency= 3
Pseudo-schedule: example Induced partition A D B C
Heuristic description • While improvement, iterate: • Different partitions are obtained by moving nodes among clusters • Partitions that produce overload resources in any of the clusters are discarded • The partition minimizing execution time is chosen • In case of tie, the one that minimizes register pressure is selected
Algorithm Overview Compute initial partition II:= MII Start scheduling Schedule Opj based on the current partition Refine Partition Able to schedule? Select next operation (j++) YES II++ NO Move Opj to another cluster NO Able to schedule? YES
The Scheduling Step • To schedule the partition we use URACAM [Codina et al., PACT’01] • Figure of merit • Uses dynamic transformations to improve the partial schedule • Register communications • Bus memory • Spill code on-the-fly • Register pressure memory • If an instruction can not be scheduled in the cluster assigned by the partition • Try all other clusters • Select the best one according to a figure of merit
Algorithm Overview Compute initial partition II:= MII Start scheduling Schedule Opj based on the current partition Refine Partition Able to schedule? Select next operation (j++) YES II++ NO Move Opj to another cluster NO Able to schedule? YES
Partition Refinement • II has increased • A better partition can be found for the new II • New slots have been generated in each cluster • More lifetimes are available • A larger number of bus communications allowed • Coarsening process is repeated • Only edges between nodes in the same set can appear in the matching • After coarsening, the induced partition will be the last partition that could not be scheduled • The reducing execution time heuristic is reapplied
Resources Unified 2-cluster 4-cluster Latencies INT FP INT/cluster 4 2 1 MEM 2 2 ARITH 1 3 FP/cluster 4 2 1 MUL/ABS 2 6 MEM/cluster 4 2 1 DIV/SQR/TRG 6 18 Benchmarks and Configurations • Benchmarks - all the SPECfp95 using the ref input set • Two schedulers evaluated: • GP – (previous work) • Pseudo-schedule (PSP)
32 registers split into 2 clusters 1 bus (L=1) 32 registers split into 4 clusters 1 bus (L=1) GP vs PSP
64 registers split into 4 clusters 1 bus (L=2) 32 registers split into 4 clusters 1 bus (L=2) GP vs PSP
Conclusions • A new algorithm to perform MS for clustered VLIW architectures • Cluster assignment based on multilevel graph partitioning • The partition algorithm is improved • Based on pseudo-schedules • Reliable information available to guide the partition • Outperform previous work • 38.5% speedup for some configurations
64 registers split into 2 clusters 1 bus (L=1) 64 registers split into 4 clusters 1 bus (L=1) GP vs PSP
Global vision when assigning clusters • Schedule follows exactly assignment • Re-scheduling does not take into account • more resources available Cluster Assignment Scheduling II++ Cluster Assignment • Local vision when assigning and scheduling • Assignment is based on current resource usage • No global view of the graph + Scheduling II++ • Global and local views of the graph • If cannot schedule, depending on the reason • Re-schedule • Re-compute cluster assignment Cluster Assignment Scheduling II++ ? Different Alternatives
Clustered Architectures • Current/future challenges in processor design • Delay in the transmission of signals • Power consumption • Architecture complexity • Solutions: • VLIW architectures • Clustering: divide the system in semi-independent units • Fast interconnects intra-cluster • Slow interconnects inter-clusters • Common trend in commercial VLIW processors •TI’s C6x •Analog’s Tigersharc •HP’s LX •Equator’s MAP1000
Initial graph New graph Final graph 1 5 1 1 3 3 1 3 Find matching Find matching Example (I) 1st STEP: Coarsening the graph
Initial graph Induced partition 1 coarsened graph Example (I) 1st STEP: Partition induced in the original graph coarsening
Reducing Execution Time • Heuristic description • Different partitions are obtained by moving nodes among clusters • Partitions overloading resources in any of the clusters are discarded • The partition minimizing execution time is chosen • In case of tie, the one that minimizes register pressure • Estimation of execution time needed Pseudo-schedules
Execution time Pseudo-schedules • Building pseudo-schedules • Dependences • Respected if possible • Else a penalty on register pressure and/or in execution time is assumed • Cluster assignment • Partition strictly followed • Valuable information can be estimated • II • Length of the pseudo-schedule • Register pressure