Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization

Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Jialin Dou and Marcelo Cintra

Contributions • Integrated compiler cost model to estimate speculative parallelization overheads • Evaluate the model with load imbalance and thread dispatch & commit overheads • accuracy: correctly predict 82% speedup/slowdown cases • performance: 5% improvement on average, up to 38% better than naive approach that speculates on all loops considered PACT 04

Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04

Why a Compiler Cost Model? • Speculative parallelization can deliver significant speedup or slowdown • several speculation overheads • some code segments could slowdown the program • we need a smart compiler to choose which program regions to run speculatively based on the expected outcome • A prediction of the value of speedup can be useful • e.g. multi-tasking environment • program A wants to run speculatively in parallel ( predicted 1.2 ) • other programs waiting to be scheduled • OS decides it does not pay off PACT 04

CPU0 CPU1 CPU2 RAW Speculative Parallelization • Threads are speculatively executed in parallel, assuming no dependence • During run-time, system will track memory references • If cross-thread data-dependence violation is detected, squash the offending threads and restart them for(i=0; i<100; i++) { … = A[L[i]]+… A[K[i]] = … } Iteration J … = A[4]+… A[5] = ... Iteration J+1 … = A[2]+… A[2] = ... Iteration J+2 … = A[5]+… A[6] = ... PACT 04

Speculative Parallelization Overheads • Squash & restart: re-executing the threads • Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative • Inter-thread communication: waiting for value from predecessor thread • Dispatch & commit: writing back speculative data into memory • Load imbalance: processor waiting for thread to become non-speculative to commit PACT 04

commit commit commit commit commit commit commit Load Imbalance Overhead • A different problem in speculative parallelization • Due to in-order-commit requirement, a processor cannot start new thread before the current thread commit • Remain idle waiting for predecessor to commit • Could account for a large fraction of execution time PE2 PE3 PE0 PE1 1 2 4 3 8 7 PACT 04

Factors Causing Load Imbalance • Difference in thread workload • different control path: (intrinsic load imbalance) • influence from other overheads. • e.g. speculative buffer overflow on a thread leads to longer waiting time on successor threads for() { if () { … } else { … } } Workload 1 (W1) Workload 2 (W2) PACT 04

PE2 PE3 PE0 PE1 PE2 PE3 PE0 PE1 commit commit commit commit commit commit 1 1 Factors Causing Load Imbalance (cont) • Assignment (locations) of the threads on the processors PACT 04

Proposed Compiler Model • Idea: • Compute a table of thread sizes based on all possible execution paths (base thread sizes) • Generate new thread sizes for those execution paths which have speculative overheads • Consider all possible assignments of above sizes on P processors, each weighted by its probability • Compute expected sequential (Tseqest) and parallel (Tparest) execution times Sest= Tseqest/Tparest PACT 04

W1, p1 W1, p1 W2, p2 w1 w1 w1 1 1 2 ld ld st st st w2 w2 p1 p1 W1 = w1+ w2 W2 = w1+ w3 W1 = w1+ w2 p1 p1 p2 p2 w3 1. Compute Thread Sizes Based on Execution Paths for() { … if () { … …=X[A[i]] … X=[B[i]]=… … } else { … Y[C[i]]=… … } } PACT 04

W1, p1 W2, p2 W1, p1 W2, p2 W3, p3 1 2 ld st st 1 2 ld st st 3 1 ld ld st st W2 = W2+ w W3 = W3+ w W1 = W1+ w p1 p1 p3 W1 W2 p2 p2 2. Generating New Thread Sizes for Speculative Overheads PACT 04

PE0 PE0 PE0 PE1 PE1 PE1 PE2 PE2 PE2 PE3 PE3 PE3 PE0 PE1 PE2 PE3 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3. Consider All Assignments: the Thread Tuple Model PACT 04

Tuple 3332 3333 1113 … p3.p3.p3.p3 p3.p3.p3.p2 p1.p1.p1.p3 … 1112 1111 p1.p1.p1.p2 p1.p1.p1.p1 Assignment … 80 81 3 2 1 Probability 3. Consider All Assignments: the Thread Tuple Model (cont) • Three thread sizes W1,W2 and W3, assigned onto 4 processors  81 variations, each called a tuple • In general: N thread sizes and P processors  NP tuples PACT 04

Tuple Tseqtuple Tpartuple 3332 3333 1113 … p3.p3.p3.p3 p3.p3.p3.p2 p1.p1.p1.p3 3.W1 + W3 W1 + 3.W3 4.W3 W3 W3 W3 … … ... 1112 1111 p1.p1.p1.p2 p1.p1.p1.p1 3.W1 + W2 4.W1 W1 W1 Assignment … 80 81 3 2 1 Probability 4. Compute Sequential and Parallel Execution Times • Within a tuple: Tseqtuple=∑ W Tpartuple=max( W ) i i intuple i i intuple PACT 04

Estimated sequential execution time: • Estimated parallel execution time: Tparest Tseqest Tuple Tseqtuple Tpartuple 3332 3333 1113 … p3.p3.p3.p3 p3.p3.p3.p2 p1.p1.p1.p3 3.W1 + W3 W1 + 3.W3 4.W3 W3 W3 W3 … … ... 1112 1111 p1.p1.p1.p2 p1.p1.p1.p1 3.W1 + W2 4.W1 W1 W1 Assignment … 80 81 3 2 1 Probability 4. Compute Sequential and Parallel Execution Times (cont) PACT 04

4. Compute Sequential and Parallel Execution Times (cont) • Estimated sequential execution time: • Estimated parallel execution time: where: O(N): no enumeration, N:number of thread sizes N Tseqest=P . ∑ W . p i i i=1 N Tparest=∑ p(Tpartuple=W ) . W O(N) i i i=1 i i-1 (∑p )P - (∑ p )P , if 2≤i≤N j j j=1 j=1 p(Tpartuple=W ) = i p P , if i=1 i PACT 04

5. Computing the Estimated Speedup • Sest= Tseqest/Tparest PACT 04

Evaluation Environment • Implementation: IR of SUIF1 • high level control structure retained • instructions within basic blocks dismantled • Simulation: trace-driven with Simics • Architecture: Stanford Hydra CMP • 4 single-issue processors • private 16KB L1 cache • private fully associative 2KB speculative buffer • shared on-chip 2MB L2 cache PACT 04

Applications • Subset of SPEC2000 benchmarks • 4 floating point and 5 integer • MinneSPEC reduced input sets: • input size: 2~3 billion instructions • simulated instructions: 100m to 600m • Focus on base model for load imbalance • do not consider loops with squash or speculative buffer overflow overheads • Total of 101 loops • most of them account for about 40% to 90% of sequential execution time of their application PACT 04

Very varied speedup/slowdown behavior Speedup Distribution PACT 04

Only 17% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model Model Accuracy (I): Outcomes PACT 04

Error less than 50% for 84% of the loops Acceptable errors, but room for improvement Model Accuracy (II): Cumulative Errors Distribution PACT 04

Model Accuracy (II): Cumulative Errors Distribution PACT 04

Can curb performance degradation of naive policy Mostly better performance than previous policies Very close to performance of oracle Performance Improvements PACT 04

Related Work • Architectures support speculative parallelization: • Multiscalar processor (Wisconsin); • Hydra (Stanford); • Clustered Speculative multithreaded processor (UPC); • Thread-level Data Speculation (CMU); • MAJC (Sun); • Superthreaded processor (Minnesota); • Multiplex (Purdue); • CMP with speculative multithreading (Illinois) PACT 04

Related Work • Compiler support for speculative parallelization: • most of above projects have a compiler branch • partition thread, optimizations based on simple heuristics • Recent publications on compiler cost model • Chen et. al. (PPoPP’03) • a mathematical model, concentrated on probabilistic points-to • Du et. al. (PLDI’04) • cost model of squash overhead based on probability of dependences • No literature found on cost model with the inclusion of load imbalance PACT 04

Conclusions • Compiler cost model of speculative multithreaded execution • Fairly accurate quantitative predictions of speedup: • correctly identify speedup/slowdown in 82% of cases • errors of less than 50% for 84% of the cases • Good model-driven selection policy: • 5% on average and as much as 38% faster than naive policy • within 5% of an oracle policy • Could potentially accommodate all other speculative execution overheads (we are working on it) PACT 04

Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Jialin Dou and Marcelo Cintra

Source of error 2 4 3 1 54~116 98~161 54~61 136 Number Unknown inner loop iteration count Incorrect IR workload estimation Unknown iteration count(i<P) Biased conditional Error (%) Sources of largest errors (top 10%) PACT 04

Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization

Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization

Presentation Transcript

Load Estimation Models

Monitoring and Pollutant Load Estimation

Enabling Speculative Parallelization via Merge Semantics in STMs

Parallelization of FFT in AFNI

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors

cc Compiler Parallelization Options

Communication Overhead Estimation on Multicores

Parallelization of RHSEG

Scalable Low Overhead Delay Estimation

Master/Slave Speculative Parallelization

Exploiting Postdominance for Speculative Parallelization

Speculative Parallelization of Partial Reduction Variables

Speculative Parallelization in Decoupled Look-ahead

Compiler Speculative Optimizations

Estimation of Messaging Overhead Distributed Detection and Inference (DDI)

CSC D70: Compiler Optimization Parallelization

Program Demultiplexing: Data-flow based Speculative Parallelization

Speculative Parallelization of Partial Reduction Variables