230 likes | 243 Views
This lecture discusses balanced scheduling, a method that spreads out instructions to cover load latency and improve performance by exploiting load-level parallelism. The lecture covers the algorithm, weight calculation, and limitations of balanced scheduling.
E N D
Advanced CompilersCMPSCI 710Spring 2003Balanced Scheduling Emery Berger University of Massachusetts, Amherst
Topics • Last time • Instruction scheduling • Gibbons & Muchnick • This time • Balanced scheduling • Kerns & Eggers
List Scheduling, Redux • Build dependence dag • Choose instructions from ready list • Schedule using heuristics[Gibbons & Muchnick] • Instruction with greatest latency • Instruction with most successors • Instruction on critical path
Fly in the Ointment • When scheduling loads, assume hit in primary cache • On older architectures, this makes sense: • Stall execution on cache miss • But newer architectures are nonblocking: • Processor executes other instructions while load in progress • Good – creates more ILP – but…
Scheduling Options • Now what? • Assume cache miss takes N cycles • N typically 10 or more • Do we schedule load: • Anticipating 1 cycle delay (a hit)? • optimistic • Or N cycle delay (a miss)? • pessimistic
Optimistic vs. Pessimistic • Optimistic: fine for hits, inferior for misses • Pessimistic: fine for hits, better for misses Optimistic L0 X2 X1 X3 X4 Pessimistic L0 X2 X3 X1 X4
Optimistic vs. Pessimistic,Multiple Loads • Optimistic: better for hits, same for misses • Pessimistic: worse for hits, same for misses Optimistic L1 X1 L2 X2 X3 Pessimistic L1 X1 X2 L2 X3
Balanced Scheduling • Key insights: • No fixed estimate of memory latency is best • Schedule based available parallelism in the code • Load level parallelism • Balanced scheduling: • Computes each weight separately • Takes other possible instructions into account • Space out loads, using available instructions as “filler”
Balanced Scheduling,Example • Maximizes distance between L0 & X1 • Good in case of miss Balanced L0 X2 X3 X1 X4
Balanced Scheduling,Example • W: load instruction weight • W=5 – over-estimate • Greedy schedule • W=1 – under-estimate • Lazy schedule • Balanced scheduler: • W=3 (= load-level parallelism)
Balanced Scheduling,Results • Always achieves fewest interlocks
Algorithm Idea • Examine each instruction i in dag • Determine which loads can run in parallel with i • Use all (or part) of i’s execution time to cover latency of loads
Balanced Scheduling,Weight Calculation • Time complexity?
Balanced Scheduling,Example • Locate longest load paths in connected components • Add 1/(# of loads) to load’s weights
Balanced Scheduling,Example II • Consider instruction X1 • Locate longest load paths in connected components • Add 1/(# of loads) to load’s weights • “contributions of X1”
Balanced Scheduling Algorithm • After computing weights, perform list scheduling where: • Priority = weight plus max priority of successors • Break ties: • Largest delta between consumed & defined registers • Rank based on successors in dag that would be exposed • Select instruction generated earliest • Bottom-up scheduler: • Reverse-order, schedule from leaves toward roots
Balanced Scheduling,Example I Balanced L0 X2 X3 X1 X4
Limitations • Performed after register allocation • But: introduces false dependences • Reuse of registers ) dag has extra edges • Can be fixed with software register renaming • Had to modify gcc’s RTL • Approach required manual pipelining • Profile-based feedback… • Benchmark based on FORTRANconverted to C with f2c • Can’t disambiguate memory • Adds many edges to dag
“Workaround”: Simulate Fortran • Modify code to avoid aliases • Improves results, but incorrect! • Needs advanced alias analysis
Empirical Results • Evaluated using simulation • 3% to 18% improvement over regular scheduler across different models • Mean: 9.9% • Unfortunately: • No results presented without above-mentioned modifications…
Conclusion • Balanced scheduling • Spreads out instructions to cover load latency • Based on exploitable load-level parallelism • Effective at improving performance • Modulo methodological limitations… • Not so great for C/C++, possibly useful for Java • Next time: interprocedural analysis • ACDI: Ch. 19, pp. 607-636, 641-656