Advanced Compilers CMPSCI 710 Spring 2003 Balanced Scheduling

Advanced CompilersCMPSCI 710Spring 2003Balanced Scheduling Emery Berger University of Massachusetts, Amherst

Topics • Last time • Instruction scheduling • Gibbons & Muchnick • This time • Balanced scheduling • Kerns & Eggers

List Scheduling, Redux • Build dependence dag • Choose instructions from ready list • Schedule using heuristics[Gibbons & Muchnick] • Instruction with greatest latency • Instruction with most successors • Instruction on critical path

Fly in the Ointment • When scheduling loads, assume hit in primary cache • On older architectures, this makes sense: • Stall execution on cache miss • But newer architectures are nonblocking: • Processor executes other instructions while load in progress • Good – creates more ILP – but…

Scheduling Options • Now what? • Assume cache miss takes N cycles • N typically 10 or more • Do we schedule load: • Anticipating 1 cycle delay (a hit)? • optimistic • Or N cycle delay (a miss)? • pessimistic

Optimistic vs. Pessimistic • Optimistic: fine for hits, inferior for misses • Pessimistic: fine for hits, better for misses Optimistic L0 X2 X1 X3 X4 Pessimistic L0 X2 X3 X1 X4

Optimistic vs. Pessimistic,Multiple Loads • Optimistic: better for hits, same for misses • Pessimistic: worse for hits, same for misses Optimistic L1 X1 L2 X2 X3 Pessimistic L1 X1 X2 L2 X3

Balanced Scheduling • Key insights: • No fixed estimate of memory latency is best • Schedule based available parallelism in the code • Load level parallelism • Balanced scheduling: • Computes each weight separately • Takes other possible instructions into account • Space out loads, using available instructions as “filler”

Balanced Scheduling,Example • Maximizes distance between L0 & X1 • Good in case of miss Balanced L0 X2 X3 X1 X4

Balanced Scheduling,Example • W: load instruction weight • W=5 – over-estimate • Greedy schedule • W=1 – under-estimate • Lazy schedule • Balanced scheduler: • W=3 (= load-level parallelism)

Balanced Scheduling,Results • Always achieves fewest interlocks

Algorithm Idea • Examine each instruction i in dag • Determine which loads can run in parallel with i • Use all (or part) of i’s execution time to cover latency of loads

Balanced Scheduling,Weight Calculation • Time complexity?

Balanced Scheduling,Example • Locate longest load paths in connected components • Add 1/(# of loads) to load’s weights

Balanced Scheduling,Example II • Consider instruction X1 • Locate longest load paths in connected components • Add 1/(# of loads) to load’s weights • “contributions of X1”

Balanced Scheduling,All Weights

Balanced Scheduling Algorithm • After computing weights, perform list scheduling where: • Priority = weight plus max priority of successors • Break ties: • Largest delta between consumed & defined registers • Rank based on successors in dag that would be exposed • Select instruction generated earliest • Bottom-up scheduler: • Reverse-order, schedule from leaves toward roots

Balanced Scheduling,Example I Balanced L0 X2 X3 X1 X4

Balanced Scheduling,Example II

Limitations • Performed after register allocation • But: introduces false dependences • Reuse of registers ) dag has extra edges • Can be fixed with software register renaming • Had to modify gcc’s RTL • Approach required manual pipelining • Profile-based feedback… • Benchmark based on FORTRANconverted to C with f2c • Can’t disambiguate memory • Adds many edges to dag

“Workaround”: Simulate Fortran • Modify code to avoid aliases • Improves results, but incorrect! • Needs advanced alias analysis

Empirical Results • Evaluated using simulation • 3% to 18% improvement over regular scheduler across different models • Mean: 9.9% • Unfortunately: • No results presented without above-mentioned modifications…

Conclusion • Balanced scheduling • Spreads out instructions to cover load latency • Based on exploitable load-level parallelism • Effective at improving performance • Modulo methodological limitations… • Not so great for C/C++, possibly useful for Java • Next time: interprocedural analysis • ACDI: Ch. 19, pp. 607-636, 641-656

Advanced Compilers CMPSCI 710 Spring 2003 Balanced Scheduling

Advanced Compilers CMPSCI 710 Spring 2003 Balanced Scheduling

Presentation Transcript

Advanced Compilers CMPSCI 710 Spring 2003 Interprocedural Analysis

Advanced Compilers CMPSCI 710 Spring 2003 Common Subexpression Elimination

Advanced Compilers CMPSCI 710 Spring 2003 More data flow analysis

Advanced Compilers CMPSCI 710 Spring 2004 Lecture 1

Advanced Compilers CMPSCI 710 Spring 2003 Data flow analysis

Advanced Compilers CMPSCI 710 Spring 2003 Lecture 2

Advanced Compilers CMPSCI 710 Spring 2003 Dominators, etc.

Advanced Compilers CMPSCI 710 Spring 2003 Basic Loop Optimizations

Advanced Compilers CMPSCI 710 Spring 2003 Using SSA form

Advanced Compilers CMPSCI 710 Spring 2003 Computing SSA

Advanced Compilers CMPSCI 710 Spring 2003 Computing SSA

Advanced Compilers CMPSCI 710 Spring 2003 Lecture 1

Advanced Compilers CMPSCI 710 Spring 2003 Common Subexpression Elimination

Advanced Compilers CMPSCI 710 Spring 2003 Dominators, etc.

Advanced Compilers CMPSCI 710 Spring 2003 Pointer Analysis

Advanced Compilers CMPSCI 710 Spring 2003 Interprocedural Analysis

Advanced Compilers CMPSCI 710 Spring 2004 Lecture 1

Advanced Compilers CMPSCI 710 Spring 2003 Partial Redundancy Elimination

Advanced Compilers CMPSCI 710 Spring 2003 Using SSA form