1 / 35

Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization

Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization. University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA. Jialin Dou and Marcelo Cintra. Contributions. Integrated compiler cost model to estimate speculative parallelization overheads

jackie
Download Presentation

Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Jialin Dou and Marcelo Cintra

  2. Contributions • Integrated compiler cost model to estimate speculative parallelization overheads • Evaluate the model with load imbalance and thread dispatch & commit overheads • accuracy: correctly predict 82% speedup/slowdown cases • performance: 5% improvement on average, up to 38% better than naive approach that speculates on all loops considered PACT 04

  3. Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04

  4. Why a Compiler Cost Model? • Speculative parallelization can deliver significant speedup or slowdown • several speculation overheads • some code segments could slowdown the program • we need a smart compiler to choose which program regions to run speculatively based on the expected outcome • A prediction of the value of speedup can be useful • e.g. multi-tasking environment • program A wants to run speculatively in parallel ( predicted 1.2 ) • other programs waiting to be scheduled • OS decides it does not pay off PACT 04

  5. Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04

  6. CPU0 CPU1 CPU2 RAW Speculative Parallelization • Threads are speculatively executed in parallel, assuming no dependence • During run-time, system will track memory references • If cross-thread data-dependence violation is detected, squash the offending threads and restart them for(i=0; i<100; i++) { … = A[L[i]]+… A[K[i]] = … } Iteration J … = A[4]+… A[5] = ... Iteration J+1 … = A[2]+… A[2] = ... Iteration J+2 … = A[5]+… A[6] = ... PACT 04

  7. Speculative Parallelization Overheads • Squash & restart: re-executing the threads • Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative • Inter-thread communication: waiting for value from predecessor thread • Dispatch & commit: writing back speculative data into memory • Load imbalance: processor waiting for thread to become non-speculative to commit PACT 04

  8. commit commit commit commit commit commit commit Load Imbalance Overhead • A different problem in speculative parallelization • Due to in-order-commit requirement, a processor cannot start new thread before the current thread commit • Remain idle waiting for predecessor to commit • Could account for a large fraction of execution time PE2 PE3 PE0 PE1 1 2 4 3 8 7 PACT 04

  9. Factors Causing Load Imbalance • Difference in thread workload • different control path: (intrinsic load imbalance) • influence from other overheads. • e.g. speculative buffer overflow on a thread leads to longer waiting time on successor threads for() { if () { … } else { … } } Workload 1 (W1) Workload 2 (W2) PACT 04

  10. PE2 PE3 PE0 PE1 PE2 PE3 PE0 PE1 commit commit commit commit commit commit 1 1 Factors Causing Load Imbalance (cont) • Assignment (locations) of the threads on the processors PACT 04

  11. Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04

  12. Proposed Compiler Model • Idea: • Compute a table of thread sizes based on all possible execution paths (base thread sizes) • Generate new thread sizes for those execution paths which have speculative overheads • Consider all possible assignments of above sizes on P processors, each weighted by its probability • Compute expected sequential (Tseqest) and parallel (Tparest) execution times Sest= Tseqest/Tparest PACT 04

  13. W1, p1 W1, p1 W2, p2 w1 w1 w1 1 1 2 ld ld st st st w2 w2 p1 p1 W1 = w1+ w2 W2 = w1+ w3 W1 = w1+ w2 p1 p1 p2 p2 w3 1. Compute Thread Sizes Based on Execution Paths for() { … if () { … …=X[A[i]] … X=[B[i]]=… … } else { … Y[C[i]]=… … } } PACT 04

  14. W1, p1 W2, p2 W1, p1 W2, p2 W3, p3 1 2 ld st st 1 2 ld st st 3 1 ld ld st st W2 = W2+ w W3 = W3+ w W1 = W1+ w p1 p1 p3 W1 W2 p2 p2 2. Generating New Thread Sizes for Speculative Overheads PACT 04

  15. PE0 PE0 PE0 PE1 PE1 PE1 PE2 PE2 PE2 PE3 PE3 PE3 PE0 PE1 PE2 PE3 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3. Consider All Assignments: the Thread Tuple Model PACT 04

  16. Tuple 3332 3333 1113 … p3.p3.p3.p3 p3.p3.p3.p2 p1.p1.p1.p3 … 1112 1111 p1.p1.p1.p2 p1.p1.p1.p1 Assignment … 80 81 3 2 1 Probability 3. Consider All Assignments: the Thread Tuple Model (cont) • Three thread sizes W1,W2 and W3, assigned onto 4 processors  81 variations, each called a tuple • In general: N thread sizes and P processors  NP tuples PACT 04

  17. Tuple Tseqtuple Tpartuple 3332 3333 1113 … p3.p3.p3.p3 p3.p3.p3.p2 p1.p1.p1.p3 3.W1 + W3 W1 + 3.W3 4.W3 W3 W3 W3 … … ... 1112 1111 p1.p1.p1.p2 p1.p1.p1.p1 3.W1 + W2 4.W1 W1 W1 Assignment … 80 81 3 2 1 Probability 4. Compute Sequential and Parallel Execution Times • Within a tuple: Tseqtuple=∑ W Tpartuple=max( W ) i i intuple i i intuple PACT 04

  18. Estimated sequential execution time: • Estimated parallel execution time: Tparest Tseqest Tuple Tseqtuple Tpartuple 3332 3333 1113 … p3.p3.p3.p3 p3.p3.p3.p2 p1.p1.p1.p3 3.W1 + W3 W1 + 3.W3 4.W3 W3 W3 W3 … … ... 1112 1111 p1.p1.p1.p2 p1.p1.p1.p1 3.W1 + W2 4.W1 W1 W1 Assignment … 80 81 3 2 1 Probability 4. Compute Sequential and Parallel Execution Times (cont) PACT 04

  19. 4. Compute Sequential and Parallel Execution Times (cont) • Estimated sequential execution time: • Estimated parallel execution time: where: O(N): no enumeration, N:number of thread sizes N Tseqest=P . ∑ W . p i i i=1 N Tparest=∑ p(Tpartuple=W ) . W O(N) i i i=1 i i-1 (∑p )P - (∑ p )P , if 2≤i≤N j j j=1 j=1 p(Tpartuple=W ) = i p P , if i=1 i PACT 04

  20. 5. Computing the Estimated Speedup • Sest= Tseqest/Tparest PACT 04

  21. Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04

  22. Evaluation Environment • Implementation: IR of SUIF1 • high level control structure retained • instructions within basic blocks dismantled • Simulation: trace-driven with Simics • Architecture: Stanford Hydra CMP • 4 single-issue processors • private 16KB L1 cache • private fully associative 2KB speculative buffer • shared on-chip 2MB L2 cache PACT 04

  23. Applications • Subset of SPEC2000 benchmarks • 4 floating point and 5 integer • MinneSPEC reduced input sets: • input size: 2~3 billion instructions • simulated instructions: 100m to 600m • Focus on base model for load imbalance • do not consider loops with squash or speculative buffer overflow overheads • Total of 101 loops • most of them account for about 40% to 90% of sequential execution time of their application PACT 04

  24. Very varied speedup/slowdown behavior Speedup Distribution PACT 04

  25. Only 17% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model Model Accuracy (I): Outcomes PACT 04

  26. Error less than 50% for 84% of the loops Acceptable errors, but room for improvement Model Accuracy (II): Cumulative Errors Distribution PACT 04

  27. Model Accuracy (II): Cumulative Errors Distribution PACT 04

  28. Can curb performance degradation of naive policy Mostly better performance than previous policies Very close to performance of oracle Performance Improvements PACT 04

  29. Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04

  30. Related Work • Architectures support speculative parallelization: • Multiscalar processor (Wisconsin); • Hydra (Stanford); • Clustered Speculative multithreaded processor (UPC); • Thread-level Data Speculation (CMU); • MAJC (Sun); • Superthreaded processor (Minnesota); • Multiplex (Purdue); • CMP with speculative multithreading (Illinois) PACT 04

  31. Related Work • Compiler support for speculative parallelization: • most of above projects have a compiler branch • partition thread, optimizations based on simple heuristics • Recent publications on compiler cost model • Chen et. al. (PPoPP’03) • a mathematical model, concentrated on probabilistic points-to • Du et. al. (PLDI’04) • cost model of squash overhead based on probability of dependences • No literature found on cost model with the inclusion of load imbalance PACT 04

  32. Outline • Motivation • Speculative Parallelization • Novel Compiler Cost Model • Evaluation • Related Work • Conclusions PACT 04

  33. Conclusions • Compiler cost model of speculative multithreaded execution • Fairly accurate quantitative predictions of speedup: • correctly identify speedup/slowdown in 82% of cases • errors of less than 50% for 84% of the cases • Good model-driven selection policy: • 5% on average and as much as 38% faster than naive policy • within 5% of an oracle policy • Could potentially accommodate all other speculative execution overheads (we are working on it) PACT 04

  34. Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Jialin Dou and Marcelo Cintra

  35. Source of error 2 4 3 1 54~116 98~161 54~61 136 Number Unknown inner loop iteration count Incorrect IR workload estimation Unknown iteration count(i<P) Biased conditional Error (%) Sources of largest errors (top 10%) PACT 04

More Related