1 / 35

Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

This paper discusses exploiting pseudo-schedules to guide data dependence graph partitioning in clustered architectures. It focuses on minimizing inter-cluster communication delays and leveraging communication locality in VLIW architectures.

Download Presentation

Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PACT 2002, Charlottesville, Virginia – September 2002 Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning Alex Aletà Josep M. Codina Jesús Sánchez Antonio González David Kaeli {aaleta, jmcodina, fran, antonio}@ac.upc.es kaeli@ece.neu.edu

  2. Clustered Architectures • Current/future challenges in processor design • Delay in the transmission of signals • Power consumption • Architecture complexity • Clustering: divide the system in semi-independent units • Each unit  Cluster • Fast interconnects intra-cluster • Slow interconnects inter-clusters • Common trend in commercial VLIW processors • TI’s C6x • Analog’s TigerSHARC • HP’s LX • Equator’s MAP1000

  3. Register Buses LOCAL REGISTER FILE LOCAL REGISTER FILE FU FU MEM FU FU MEM CLUSTER 1 CLUSTER n L1 CACHE Architecture Overview

  4. Instruction Scheduling • For non-clustered architectures • Resources • Dependences • For clustered architectures • Cluster assignment • Minimize inter-cluster communication delays • Exploit communication locality • This work focuses on modulo scheduling for clustered VLIW architectures • Technique to schedule loops

  5. Talk Outline • Previous work • Proposed algorithm • Overview • Graph partitioning • Pseudo-scheduling • Performance evaluation • Conclusions

  6. Cluster Assignment Scheduling Cluster Assignment II++ + Scheduling • One step • There is no initial cluster assignment • The scheduler is free to choose any cluster II++ MS for Clustered Architectures • Two steps • Data Dependence Graph partitioning: each instruction is assigned to a cluster • Scheduling: instructions are scheduled in a suitable slot but only in the preassigned cluster • In previous work, two different approaches were proposed:

  7. Goal of the Work • Both approaches have benefits • Two steps • Global vision of the Data Dependence Graph • Workload is better split among different clusters • Number of communications is reduced • One step • Local vision of partial scheduling • Cluster assignment is performed with information of the partial scheduling • Goal: obtain an algorithm taking advantage of the benefits of both approaches

  8. Baseline • Baseline scheme: GP [Aletà et al., Micro34] • Cluster assignment performed with a graph partitioning algorithm • Feed-back between the partitioning and the scheduler • Results outperformed previous approaches • Still little information available for cluster assignment • New algorithm: better partition • Pseudo-schedules are used to guide the partition • Global vision of the Data Dependence Graph • More information to perform cluster assignment

  9. Algorithm Overview Compute initial partition II:= MII Start scheduling Schedule Opj based on the current partition Refine Partition Able to schedule? Select next operation (j++) YES II++ NO Move Opj to another cluster NO Able to schedule? YES

  10. Algorithm Overview Compute initial partition II:= MII Start scheduling Schedule Opj based on the current partition Refine Partition Able to schedule? Select next operation (j++) YES II++ NO Move Opj to another cluster NO Able to schedule? YES

  11. Graph Partitioning Background • Problem statement • Split the nodes into a pre-determined number of sets and optimizing some functions • Multilevel strategy • Coarsen the graph • Iteratively, fuse pairs of nodes into new macro-nodes • Enhancing heuristics • Avoid excess load in any one set • Reduce execution time of the loops

  12. Graph Coarsening • Previous definitions • Matching • Slack • Iterate until same number of nodes than clusters: • The edges are weighted according to • Impact on execution time of adding a bus delay to the edge • Slack of the edge • Then, select the maximum weight matching • Nodes linked by edges in the matching are fused in a single macro-node

  13. Initial graph Final graph 4 4 1 2 4 4 2 4 4 Find matching Find matching Coarsening Example

  14. coarsening Example (II) 1st STEP: Partition induced in the original graph Initial graph Induced Partition Final graph

  15. Reducing Execution Time • Estimation of execution time needed Pseudo-schedules • Information obtained • II • SC • Lifetimes • Spills

  16. Building pseudo-schedules • Dependences • Respected if possible • Else a penalty on register pressure and/or in execution time is assessed • Cluster assignment • Partition strictly followed

  17. Induced partition A D B C Pseudo-schedule: example • 2 clusters, 1 FU/cluster, 1 bus of latency 1, II= 2 Instruction latency= 3

  18. Pseudo-schedule: example Induced partition A D B C

  19. Heuristic description • While improvement, iterate: • Different partitions are obtained by moving nodes among clusters • Partitions that produce overload resources in any of the clusters are discarded • The partition minimizing execution time is chosen • In case of tie, the one that minimizes register pressure is selected

  20. Algorithm Overview Compute initial partition II:= MII Start scheduling Schedule Opj based on the current partition Refine Partition Able to schedule? Select next operation (j++) YES II++ NO Move Opj to another cluster NO Able to schedule? YES

  21. The Scheduling Step • To schedule the partition we use URACAM [Codina et al., PACT’01] • Figure of merit • Uses dynamic transformations to improve the partial schedule • Register communications • Bus  memory • Spill code on-the-fly • Register pressure  memory • If an instruction can not be scheduled in the cluster assigned by the partition • Try all other clusters • Select the best one according to a figure of merit

  22. Algorithm Overview Compute initial partition II:= MII Start scheduling Schedule Opj based on the current partition Refine Partition Able to schedule? Select next operation (j++) YES II++ NO Move Opj to another cluster NO Able to schedule? YES

  23. Partition Refinement • II has increased • A better partition can be found for the new II • New slots have been generated in each cluster • More lifetimes are available • A larger number of bus communications allowed • Coarsening process is repeated • Only edges between nodes in the same set can appear in the matching • After coarsening, the induced partition will be the last partition that could not be scheduled • The reducing execution time heuristic is reapplied

  24. Resources Unified 2-cluster 4-cluster Latencies INT FP INT/cluster 4 2 1 MEM 2 2 ARITH 1 3 FP/cluster 4 2 1 MUL/ABS 2 6 MEM/cluster 4 2 1 DIV/SQR/TRG 6 18 Benchmarks and Configurations • Benchmarks - all the SPECfp95 using the ref input set • Two schedulers evaluated: • GP – (previous work) • Pseudo-schedule (PSP)

  25. 32 registers split into 2 clusters 1 bus (L=1) 32 registers split into 4 clusters 1 bus (L=1) GP vs PSP

  26. 64 registers split into 4 clusters 1 bus (L=2) 32 registers split into 4 clusters 1 bus (L=2) GP vs PSP

  27. Conclusions • A new algorithm to perform MS for clustered VLIW architectures • Cluster assignment based on multilevel graph partitioning • The partition algorithm is improved • Based on pseudo-schedules • Reliable information available to guide the partition • Outperform previous work • 38.5% speedup for some configurations

  28. Any questions?

  29. 64 registers split into 2 clusters 1 bus (L=1) 64 registers split into 4 clusters 1 bus (L=1) GP vs PSP

  30. Global vision when assigning clusters • Schedule follows exactly assignment • Re-scheduling does not take into account • more resources available Cluster Assignment Scheduling II++ Cluster Assignment • Local vision when assigning and scheduling • Assignment is based on current resource usage • No global view of the graph + Scheduling II++ • Global and local views of the graph • If cannot schedule, depending on the reason • Re-schedule • Re-compute cluster assignment Cluster Assignment Scheduling II++ ? Different Alternatives

  31. Clustered Architectures • Current/future challenges in processor design • Delay in the transmission of signals • Power consumption • Architecture complexity • Solutions: • VLIW architectures • Clustering: divide the system in semi-independent units • Fast interconnects intra-cluster • Slow interconnects inter-clusters • Common trend in commercial VLIW processors •TI’s C6x •Analog’s Tigersharc •HP’s LX •Equator’s MAP1000

  32. Initial graph New graph Final graph 1 5 1 1 3 3 1 3 Find matching Find matching Example (I) 1st STEP: Coarsening the graph

  33. Initial graph Induced partition 1 coarsened graph Example (I) 1st STEP: Partition induced in the original graph coarsening

  34. Reducing Execution Time • Heuristic description • Different partitions are obtained by moving nodes among clusters • Partitions overloading resources in any of the clusters are discarded • The partition minimizing execution time is chosen • In case of tie, the one that minimizes register pressure • Estimation of execution time needed Pseudo-schedules

  35. Execution time Pseudo-schedules • Building pseudo-schedules • Dependences • Respected if possible • Else a penalty on register pressure and/or in execution time is assumed • Cluster assignment • Partition strictly followed • Valuable information can be estimated • II • Length of the pseudo-schedule • Register pressure

More Related