170 likes | 278 Views
CS 7960-4 Lecture 5. Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez UPC-Barcelona IJPP ’01. Clustered Processors. Two primary motivations: hard to design 8-way machines in future technologies
E N D
CS 7960-4 Lecture 5 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez UPC-Barcelona IJPP ’01
Clustered Processors • Two primary motivations: • hard to design 8-way machines in future technologies • the FP cluster is idle most of the time • Advantages: • Few entries, few ports low delays fast clocks, simple pipelines • Every instruction is not penalized for wire delays • Potential for large windows and high ILP • Design and verification costs do not scale up (?)
Dependences r1 r2 + r3 cl-1 r4 r1 + r2 cl-1 r5 r6 + r7 cl-2 r8 r5 + r1 ? • During rename, steer dependent instructions to • the same cluster • However, we do not know about converging chains • (can have workarounds – traces/compilers) • If the assigned cluster is full, do we stall or go • elsewhere? – not clarified in the paper
Load Imbalance • All instructions in 1 cluster zero communication, • but zero utilization of other resources • Six ready instructions in cl-1 and two in cl-2 • more contention and wasted issue slots • Ready instructions in each should be equal – • however, instruction readiness happens long • after instruction steering
Load Imbalance Metrics • Metrics: • Instrs in each cluster • Unissued instrs that could have issued elsewhere (note latency between steer & issue) • The second metric does not help much
Instruction Assignment Reg-rename & Instr steer IQ IQ Regfile Regfile F F F F 40 regs in each cluster p21 p2 + p3 p22 p21 + p2 p42 p21 p41 p56 + p57 p43 p42 + p41 r1 r2 + r3 r4 r1 + r2 r5 r6 + r7 r8 r1 + r5 r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs
Assignment by the Compiler • ISA modification • Less accurate notion of load • Depends on good branch prediction, memory • dependence prediction, cache miss prediction, • contention modeling, etc. • Dynamic mechanisms can add pipeline stages
Steering Heuristics • Simple Register Mapping Based Steering • (Simple-RMBS): if communication cannot be • avoided, pick a random cluster • Balanced-RMBS: if communication cannot be • avoided, pick the less-loaded cluster • Advanced-RMBS: if significant imbalance, pick • the less-loaded cluster, else use Balanced-RMBS • Modulo-steering: assignment alternates between • clusters
Results • Modulo steering: too much communication • Balanced and Simple RMBS do well (27 and 22% • better than the base) – less than 3 comms per 100 • instructions (a single bus is enough) – assuming • zero comm-cost isolates effect of workload • imbalance (Fig. 5) • Advanced RMBS performs 35% better than base • The max possible improvement (UB model) is 44%
Other Results • Scheduling constraints limit improvements for • FP programs • The compiler can do better than what Fig.10 • indicates • Palacharla algorithm doesn’t do as well – no • load considerations and few FIFOs more • communication
Optimizations • Information on converging chains (slices) • First-fit and Mod-N • Identify critical source operands • Interconnect-sensitive steering • Stalls in dispatch
Future Trends • Increased wire delays and more transistors • each cluster is smaller • more clusters • latency across clusters is higher • Load imbalance and communication become • worse – the best heuristic/threshold will depend • on the assumed model/latency • Data cache access time increases
Dynamic Cluster Allocation • At some point, using more clusters can increase • communication costs and worsen performance • More clusters larger windows/FUs more ILP • more communication penalties • Steering heuristic should take degree of ILP into • account (ISCA ’03)
Other Recent Papers • Hierarchical interconnect designs – Aggarwal and • Franklin • Distributed data caches – UPC • Power-efficiency of clustered designs – Zyuban and • Kogge • TRIPS processor – UT-Austin (compiler mapping)
Important Problems L2 L1D L1D L2 • Cluster allocation to threads • Design of interconnects • Latency tolerance • Exploiting heterogeneity • Power efficiency F E F E F E F E L1D L1D
Next Week’s Paper • “The Optimal Logic Depth per Pipeline Stage is • 6 to 8 FO4 Inverter Delays”, UT-Austin/Compaq, • ISCA’02 • How far will deep pipelining take you? • Project discussions on Mar 2nd
Title • Bullet