350 likes | 481 Views
Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization. University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA. Jialin Dou and Marcelo Cintra. Contributions. Integrated compiler cost model to estimate speculative parallelization overheads
E N D
Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Jialin Dou and Marcelo Cintra
Contributions • Integrated compiler cost model to estimate speculative parallelization overheads • Evaluate the model with load imbalance and thread dispatch & commit overheads • accuracy: correctly predict 82% speedup/slowdown cases • performance: 5% improvement on average, up to 38% better than naive approach that speculates on all loops considered PACT 04
Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04
Why a Compiler Cost Model? • Speculative parallelization can deliver significant speedup or slowdown • several speculation overheads • some code segments could slowdown the program • we need a smart compiler to choose which program regions to run speculatively based on the expected outcome • A prediction of the value of speedup can be useful • e.g. multi-tasking environment • program A wants to run speculatively in parallel ( predicted 1.2 ) • other programs waiting to be scheduled • OS decides it does not pay off PACT 04
Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04
CPU0 CPU1 CPU2 RAW Speculative Parallelization • Threads are speculatively executed in parallel, assuming no dependence • During run-time, system will track memory references • If cross-thread data-dependence violation is detected, squash the offending threads and restart them for(i=0; i<100; i++) { … = A[L[i]]+… A[K[i]] = … } Iteration J … = A[4]+… A[5] = ... Iteration J+1 … = A[2]+… A[2] = ... Iteration J+2 … = A[5]+… A[6] = ... PACT 04
Speculative Parallelization Overheads • Squash & restart: re-executing the threads • Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative • Inter-thread communication: waiting for value from predecessor thread • Dispatch & commit: writing back speculative data into memory • Load imbalance: processor waiting for thread to become non-speculative to commit PACT 04
commit commit commit commit commit commit commit Load Imbalance Overhead • A different problem in speculative parallelization • Due to in-order-commit requirement, a processor cannot start new thread before the current thread commit • Remain idle waiting for predecessor to commit • Could account for a large fraction of execution time PE2 PE3 PE0 PE1 1 2 4 3 8 7 PACT 04
Factors Causing Load Imbalance • Difference in thread workload • different control path: (intrinsic load imbalance) • influence from other overheads. • e.g. speculative buffer overflow on a thread leads to longer waiting time on successor threads for() { if () { … } else { … } } Workload 1 (W1) Workload 2 (W2) PACT 04
PE2 PE3 PE0 PE1 PE2 PE3 PE0 PE1 commit commit commit commit commit commit 1 1 Factors Causing Load Imbalance (cont) • Assignment (locations) of the threads on the processors PACT 04
Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04
Proposed Compiler Model • Idea: • Compute a table of thread sizes based on all possible execution paths (base thread sizes) • Generate new thread sizes for those execution paths which have speculative overheads • Consider all possible assignments of above sizes on P processors, each weighted by its probability • Compute expected sequential (Tseqest) and parallel (Tparest) execution times Sest= Tseqest/Tparest PACT 04
W1, p1 W1, p1 W2, p2 w1 w1 w1 1 1 2 ld ld st st st w2 w2 p1 p1 W1 = w1+ w2 W2 = w1+ w3 W1 = w1+ w2 p1 p1 p2 p2 w3 1. Compute Thread Sizes Based on Execution Paths for() { … if () { … …=X[A[i]] … X=[B[i]]=… … } else { … Y[C[i]]=… … } } PACT 04
W1, p1 W2, p2 W1, p1 W2, p2 W3, p3 1 2 ld st st 1 2 ld st st 3 1 ld ld st st W2 = W2+ w W3 = W3+ w W1 = W1+ w p1 p1 p3 W1 W2 p2 p2 2. Generating New Thread Sizes for Speculative Overheads PACT 04
PE0 PE0 PE0 PE1 PE1 PE1 PE2 PE2 PE2 PE3 PE3 PE3 PE0 PE1 PE2 PE3 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3. Consider All Assignments: the Thread Tuple Model PACT 04
Tuple 3332 3333 1113 … p3.p3.p3.p3 p3.p3.p3.p2 p1.p1.p1.p3 … 1112 1111 p1.p1.p1.p2 p1.p1.p1.p1 Assignment … 80 81 3 2 1 Probability 3. Consider All Assignments: the Thread Tuple Model (cont) • Three thread sizes W1,W2 and W3, assigned onto 4 processors 81 variations, each called a tuple • In general: N thread sizes and P processors NP tuples PACT 04
Tuple Tseqtuple Tpartuple 3332 3333 1113 … p3.p3.p3.p3 p3.p3.p3.p2 p1.p1.p1.p3 3.W1 + W3 W1 + 3.W3 4.W3 W3 W3 W3 … … ... 1112 1111 p1.p1.p1.p2 p1.p1.p1.p1 3.W1 + W2 4.W1 W1 W1 Assignment … 80 81 3 2 1 Probability 4. Compute Sequential and Parallel Execution Times • Within a tuple: Tseqtuple=∑ W Tpartuple=max( W ) i i intuple i i intuple PACT 04
Estimated sequential execution time: • Estimated parallel execution time: Tparest Tseqest Tuple Tseqtuple Tpartuple 3332 3333 1113 … p3.p3.p3.p3 p3.p3.p3.p2 p1.p1.p1.p3 3.W1 + W3 W1 + 3.W3 4.W3 W3 W3 W3 … … ... 1112 1111 p1.p1.p1.p2 p1.p1.p1.p1 3.W1 + W2 4.W1 W1 W1 Assignment … 80 81 3 2 1 Probability 4. Compute Sequential and Parallel Execution Times (cont) PACT 04
4. Compute Sequential and Parallel Execution Times (cont) • Estimated sequential execution time: • Estimated parallel execution time: where: O(N): no enumeration, N:number of thread sizes N Tseqest=P . ∑ W . p i i i=1 N Tparest=∑ p(Tpartuple=W ) . W O(N) i i i=1 i i-1 (∑p )P - (∑ p )P , if 2≤i≤N j j j=1 j=1 p(Tpartuple=W ) = i p P , if i=1 i PACT 04
5. Computing the Estimated Speedup • Sest= Tseqest/Tparest PACT 04
Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04
Evaluation Environment • Implementation: IR of SUIF1 • high level control structure retained • instructions within basic blocks dismantled • Simulation: trace-driven with Simics • Architecture: Stanford Hydra CMP • 4 single-issue processors • private 16KB L1 cache • private fully associative 2KB speculative buffer • shared on-chip 2MB L2 cache PACT 04
Applications • Subset of SPEC2000 benchmarks • 4 floating point and 5 integer • MinneSPEC reduced input sets: • input size: 2~3 billion instructions • simulated instructions: 100m to 600m • Focus on base model for load imbalance • do not consider loops with squash or speculative buffer overflow overheads • Total of 101 loops • most of them account for about 40% to 90% of sequential execution time of their application PACT 04
Very varied speedup/slowdown behavior Speedup Distribution PACT 04
Only 17% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model Model Accuracy (I): Outcomes PACT 04
Error less than 50% for 84% of the loops Acceptable errors, but room for improvement Model Accuracy (II): Cumulative Errors Distribution PACT 04
Can curb performance degradation of naive policy Mostly better performance than previous policies Very close to performance of oracle Performance Improvements PACT 04
Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04
Related Work • Architectures support speculative parallelization: • Multiscalar processor (Wisconsin); • Hydra (Stanford); • Clustered Speculative multithreaded processor (UPC); • Thread-level Data Speculation (CMU); • MAJC (Sun); • Superthreaded processor (Minnesota); • Multiplex (Purdue); • CMP with speculative multithreading (Illinois) PACT 04
Related Work • Compiler support for speculative parallelization: • most of above projects have a compiler branch • partition thread, optimizations based on simple heuristics • Recent publications on compiler cost model • Chen et. al. (PPoPP’03) • a mathematical model, concentrated on probabilistic points-to • Du et. al. (PLDI’04) • cost model of squash overhead based on probability of dependences • No literature found on cost model with the inclusion of load imbalance PACT 04
Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04
Conclusions • Compiler cost model of speculative multithreaded execution • Fairly accurate quantitative predictions of speedup: • correctly identify speedup/slowdown in 82% of cases • errors of less than 50% for 84% of the cases • Good model-driven selection policy: • 5% on average and as much as 38% faster than naive policy • within 5% of an oracle policy • Could potentially accommodate all other speculative execution overheads (we are working on it) PACT 04
Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Jialin Dou and Marcelo Cintra
Source of error 2 4 3 1 54~116 98~161 54~61 136 Number Unknown inner loop iteration count Incorrect IR workload estimation Unknown iteration count(i<P) Biased conditional Error (%) Sources of largest errors (top 10%) PACT 04