CS 7960-4 Lecture 5

CS 7960-4 Lecture 5 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez UPC-Barcelona IJPP ’01

Clustered Processors • Two primary motivations: • hard to design 8-way machines in future technologies • the FP cluster is idle most of the time • Advantages: • Few entries, few ports  low delays  fast clocks, simple pipelines • Every instruction is not penalized for wire delays • Potential for large windows and high ILP • Design and verification costs do not scale up (?)

Dependences r1  r2 + r3 cl-1 r4  r1 + r2 cl-1 r5  r6 + r7 cl-2 r8  r5 + r1 ? • During rename, steer dependent instructions to • the same cluster • However, we do not know about converging chains • (can have workarounds – traces/compilers) • If the assigned cluster is full, do we stall or go • elsewhere? – not clarified in the paper

Load Imbalance • All instructions in 1 cluster  zero communication, • but zero utilization of other resources • Six ready instructions in cl-1 and two in cl-2  • more contention and wasted issue slots • Ready instructions in each should be equal – • however, instruction readiness happens long • after instruction steering

Load Imbalance Metrics • Metrics: • Instrs in each cluster • Unissued instrs that could have issued elsewhere (note latency between steer & issue) • The second metric does not help much

Instruction Assignment Reg-rename & Instr steer IQ IQ Regfile Regfile F F F F 40 regs in each cluster p21  p2 + p3 p22  p21 + p2 p42  p21 p41  p56 + p57 p43  p42 + p41 r1  r2 + r3 r4  r1 + r2 r5  r6 + r7 r8  r1 + r5 r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs

Assignment by the Compiler • ISA modification • Less accurate notion of load • Depends on good branch prediction, memory • dependence prediction, cache miss prediction, • contention modeling, etc. • Dynamic mechanisms can add pipeline stages

Steering Heuristics • Simple Register Mapping Based Steering • (Simple-RMBS): if communication cannot be • avoided, pick a random cluster • Balanced-RMBS: if communication cannot be • avoided, pick the less-loaded cluster • Advanced-RMBS: if significant imbalance, pick • the less-loaded cluster, else use Balanced-RMBS • Modulo-steering: assignment alternates between • clusters

Results • Modulo steering: too much communication • Balanced and Simple RMBS do well (27 and 22% • better than the base) – less than 3 comms per 100 • instructions (a single bus is enough) – assuming • zero comm-cost isolates effect of workload • imbalance (Fig. 5) • Advanced RMBS performs 35% better than base • The max possible improvement (UB model) is 44%

Other Results • Scheduling constraints limit improvements for • FP programs • The compiler can do better than what Fig.10 • indicates • Palacharla algorithm doesn’t do as well – no • load considerations and few FIFOs  more • communication

Optimizations • Information on converging chains (slices) • First-fit and Mod-N • Identify critical source operands • Interconnect-sensitive steering • Stalls in dispatch

Future Trends • Increased wire delays and more transistors  • each cluster is smaller • more clusters • latency across clusters is higher • Load imbalance and communication become • worse – the best heuristic/threshold will depend • on the assumed model/latency • Data cache access time increases

Dynamic Cluster Allocation • At some point, using more clusters can increase • communication costs and worsen performance • More clusters  larger windows/FUs  more ILP •  more communication penalties • Steering heuristic should take degree of ILP into • account (ISCA ’03)

Other Recent Papers • Hierarchical interconnect designs – Aggarwal and • Franklin • Distributed data caches – UPC • Power-efficiency of clustered designs – Zyuban and • Kogge • TRIPS processor – UT-Austin (compiler mapping)

Important Problems L2 L1D L1D L2 • Cluster allocation to threads • Design of interconnects • Latency tolerance • Exploiting heterogeneity • Power efficiency F E F E F E F E L1D L1D

Next Week’s Paper • “The Optimal Logic Depth per Pipeline Stage is • 6 to 8 FO4 Inverter Delays”, UT-Austin/Compaq, • ISCA’02 • How far will deep pipelining take you? • Project discussions on Mar 2nd

Title • Bullet

CS 7960-4 Lecture 5

CS 7960-4 Lecture 5

Presentation Transcript

CS 584 Lecture 4

CS 7960-4 Lecture 20

CS 7960-4 Lecture 24

CS 161 Lecture 4

CS 7960-4 Lecture 8

CS 519: Lecture 4

CS 140L Lecture 4

CS 140L Lecture 4

CS 425 Lecture 4

CS 7960-4 Lecture 23

CS 7960-4 Lecture 2

CS 140 Lecture 5

CS 7960-4 Lecture 17

CS 160: Lecture 4

CS 7960-4 Lecture 10

CS 7960-4 Lecture 7

CS 7960-4 Lecture 20

CS 7960-4 Lecture 4

CS 7810 Lecture 5

CS 7960-4 Lecture 20

CS 7960-4 Lecture 14

CS 7960-4 Lecture 18