1 / 17

CS 7960-4 Lecture 5

CS 7960-4 Lecture 5. Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez UPC-Barcelona IJPP ’01. Clustered Processors. Two primary motivations: hard to design 8-way machines in future technologies

dareh
Download Presentation

CS 7960-4 Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7960-4 Lecture 5 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez UPC-Barcelona IJPP ’01

  2. Clustered Processors • Two primary motivations: • hard to design 8-way machines in future technologies • the FP cluster is idle most of the time • Advantages: • Few entries, few ports  low delays  fast clocks, simple pipelines • Every instruction is not penalized for wire delays • Potential for large windows and high ILP • Design and verification costs do not scale up (?)

  3. Dependences r1  r2 + r3 cl-1 r4  r1 + r2 cl-1 r5  r6 + r7 cl-2 r8  r5 + r1 ? • During rename, steer dependent instructions to • the same cluster • However, we do not know about converging chains • (can have workarounds – traces/compilers) • If the assigned cluster is full, do we stall or go • elsewhere? – not clarified in the paper

  4. Load Imbalance • All instructions in 1 cluster  zero communication, • but zero utilization of other resources • Six ready instructions in cl-1 and two in cl-2  • more contention and wasted issue slots • Ready instructions in each should be equal – • however, instruction readiness happens long • after instruction steering

  5. Load Imbalance Metrics • Metrics: • Instrs in each cluster • Unissued instrs that could have issued elsewhere (note latency between steer & issue) • The second metric does not help much

  6. Instruction Assignment Reg-rename & Instr steer IQ IQ Regfile Regfile F F F F 40 regs in each cluster p21  p2 + p3 p22  p21 + p2 p42  p21 p41  p56 + p57 p43  p42 + p41 r1  r2 + r3 r4  r1 + r2 r5  r6 + r7 r8  r1 + r5 r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs

  7. Assignment by the Compiler • ISA modification • Less accurate notion of load • Depends on good branch prediction, memory • dependence prediction, cache miss prediction, • contention modeling, etc. • Dynamic mechanisms can add pipeline stages

  8. Steering Heuristics • Simple Register Mapping Based Steering • (Simple-RMBS): if communication cannot be • avoided, pick a random cluster • Balanced-RMBS: if communication cannot be • avoided, pick the less-loaded cluster • Advanced-RMBS: if significant imbalance, pick • the less-loaded cluster, else use Balanced-RMBS • Modulo-steering: assignment alternates between • clusters

  9. Results • Modulo steering: too much communication • Balanced and Simple RMBS do well (27 and 22% • better than the base) – less than 3 comms per 100 • instructions (a single bus is enough) – assuming • zero comm-cost isolates effect of workload • imbalance (Fig. 5) • Advanced RMBS performs 35% better than base • The max possible improvement (UB model) is 44%

  10. Other Results • Scheduling constraints limit improvements for • FP programs • The compiler can do better than what Fig.10 • indicates • Palacharla algorithm doesn’t do as well – no • load considerations and few FIFOs  more • communication

  11. Optimizations • Information on converging chains (slices) • First-fit and Mod-N • Identify critical source operands • Interconnect-sensitive steering • Stalls in dispatch

  12. Future Trends • Increased wire delays and more transistors  • each cluster is smaller • more clusters • latency across clusters is higher • Load imbalance and communication become • worse – the best heuristic/threshold will depend • on the assumed model/latency • Data cache access time increases

  13. Dynamic Cluster Allocation • At some point, using more clusters can increase • communication costs and worsen performance • More clusters  larger windows/FUs  more ILP •  more communication penalties • Steering heuristic should take degree of ILP into • account (ISCA ’03)

  14. Other Recent Papers • Hierarchical interconnect designs – Aggarwal and • Franklin • Distributed data caches – UPC • Power-efficiency of clustered designs – Zyuban and • Kogge • TRIPS processor – UT-Austin (compiler mapping)

  15. Important Problems L2 L1D L1D L2 • Cluster allocation to threads • Design of interconnects • Latency tolerance • Exploiting heterogeneity • Power efficiency F E F E F E F E L1D L1D

  16. Next Week’s Paper • “The Optimal Logic Depth per Pipeline Stage is • 6 to 8 FO4 Inverter Delays”, UT-Austin/Compaq, • ISCA’02 • How far will deep pipelining take you? • Project discussions on Mar 2nd

  17. Title • Bullet

More Related