1 / 57

Mixed Speculative Multithreaded Execution Models

Mixed Speculative Multithreaded Execution Models. University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA. Marcelo Cintra. Context and Motivation. Multi-cores are here to stay and many-cores are coming

iorwen
Download Presentation

Mixed Speculative Multithreaded Execution Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mixed Speculative Multithreaded Execution Models University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Marcelo Cintra

  2. Context and Motivation • Multi-cores are here to stay and many-cores are coming • Excellent performance for embarrassingly parallel or throughput workloads. Otherwise, ... Core Duo (Yonah - 2006) Core i7 (Lynnfield - 2010) SCC (2015?) University of Manchester - 2010

  3. Context and Motivation • Future many-cores will have many idle cores too often • Not enough applications • Not enough benefit from using more “explicit user threads” • Our proposal: use spare cores to accelerate whatever threads are available • Create implicit threads to run in parallel with main explicit user threads • Accelerate user threads through increased coarse-grain overlap (i.e., TLP) or increased fine-grain overlap (i.e., ILP) • Combine previously proposed speculative multithreading techniques: thread-level speculation (TLS), helper threads (HT), run-ahead execution (RA), and multi-path execution (MP) University of Manchester - 2010

  4. Context and Motivation • Why is combining SM schemes a good idea? • No speculative multithreading scheme alone is good enough • Hardware support for all schemes is very similar • Expected end-result? • Better performance • More “effective” many-core experience: • With 4-way speculative multithreading (i.e., 4 implicit SM threads for each explicit user thread) a 64 core “unwieldy” many-core is as “easy to handle” as a 16 core system • How about power efficiency? • Speculation can be made less inefficient (we’re working on it) • Power can be smartly allocated (see our IPDPS’10 paper) University of Manchester - 2010

  5. Contributions • Introduce mixed Speculative Multithreading (SM) Execution Models • Design and evaluated two combinations: TLS+HT+RA [ICS’09] and TLS+MP [HPCA’10] • Propose a performance model able to quantify ILP and TLP benefits • Combined approaches outperform state-of-the-art SM models: • TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and RA by 18.3 % avg. (up to 35.2%) • TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by 28.2 % avg. (up to 138%) University of Manchester - 2010

  6. Outline • Introduction • Speculative multithreading models • Combined TLS+HT+RA scheme • Combined TLS+MP scheme • Performance model • Experimental setup and results • Conclusions University of Manchester - 2010

  7. Speculative Multithreading • Basic Idea: Use idle cores/contexts to speculate on future application needs • TLS: speculatively execute parallel threads • HT/RA: speculatively perform future memory operations • MP: speculatively execute along multiple branch targets • Speculative threads supported in hardware • Compiler support not essential, but can be useful • Hardware infrastructure is very similar University of Manchester - 2010

  8. Thread Level Speculation • Compiler deals with: • Task selection • Code generation • HW deals with: • Different context • Spawn threads • Detecting violations • Replaying • Arbitrate commit • Benefit: TLP/ILP • TLP (Overlapped Execution) • + ILP (Prefetching) University of Manchester - 2010

  9. Helper Threads • Compiler deals with: • Memory ops miss/ hard-to-predict branches • Backward slices • HW deals with: • Spawn threads • Different context • Discard when finished • Benefit: • ILP (Prefetch/Warmup) University of Manchester - 2010

  10. RunAhead Execution • Compiler deals with: • Nothing • HW deals with: • Different context • When to do RA • VP Memory • Commit/Discard • Benefit: • ILP (Prefetch/Warmup) University of Manchester - 2010

  11. MultiPath Execution • Compiler deals with: • Nothing • HW deals with: • Different context • When to do MP • Discard wrong path • Benefit: • ILP (Branch Pred.) Main Thread Correct Paths Branch Misp. Cost Time Wrong Paths University of Manchester - 2010

  12. Outline • Introduction • Speculative multithreading models • Combined TLS+HT+RA scheme • Combined TLS+MP scheme • Performance model • Experimental setup and results • Conclusions University of Manchester - 2010

  13. Combining TLS, HT and RA • Start with TLS • Provide support to clone TLS threads and convert them to HT • Conversion to HT means: • Put them in RA mode • Suppress squashes and do not cause additional squashes • Discard them when they finish • No compiler slicing  purely HW approach University of Manchester - 2010

  14. Intricacies to be Handled • HT may not prefetch effectively! • Dealing with contention • HT threads much faster  saturate BW • Dealing with thread ordering • TLS imposes total thread order • HT killed  squashes TLS threads University of Manchester - 2010

  15. Creating and Terminating HT • Create a HT on a L2 miss we can VP • Use mem. address based confidence estimator • VP only if confident • Create a HT if we have a free processor • Only allow most speculative thread to clone • Seamless integration of HT with TLS • BUT: if parent no longer the most spec. TLS thread, the HT has to be killed • Additionally kill HT when: • Parent/HT thread finishes • HT causes exception University of Manchester - 2010

  16. Outline • Introduction • Speculative multithreading models • Combined TLS+HT+RA scheme • Combined TLS+MP scheme • Performance model • Experimental setup and results • Conclusions University of Manchester - 2010

  17. Mixed Execution Model • When idle resources: • Try MP on top of TLS!! • Map TLS threads on empty cores • Map MP threads on empty contexts (same core) • Minimal extra HW: • Branch confidence estimator • MP bit – thread on MP mode • PATHS – how many outstanding branches • DIR – which path thread followed University of Manchester - 2010

  18. Combined TLS/MP Model Speculative Thread 1 Thread 2 Time University of Manchester - 2010

  19. Combined TLS/MP Model Speculative Thread 1 MP: 0 PATHS: 000 DIR: 000 Thread 1 Thread 2 Time Low Confidence Branch University of Manchester - 2010

  20. Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 1 PATHS: 001 DIR: 001 Multi-Path Mode University of Manchester - 2010

  21. Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 0 PATHS: 000 DIR: 000 Branch Resolved University of Manchester - 2010

  22. Intricacies to be Handled • How do we map TLS/MP threads? • Different mapping policies for TLS threads • Dealing with thread ordering • Correct data forwarding • Dealing with violations • While in “MP-Mode” delay restarts/kills/commits • No squashes on the wrong path • Thread spawning: • Delayed as well – keep contention low University of Manchester - 2010

  23. Outline • Introduction • Speculative multithreading models • Combined TLS+HT+RA scheme • Combined TLS+MP scheme • Performance model • Experimental setup and results • Conclusions University of Manchester - 2010

  24. Understanding Performance Benefits • Complex TLS thread interactions, obscure performance benefits • Even more true for mixed execution models • We need a way to quantify ILP and TLP contributions to bottom-line performance • Proposed model: • Able to break benefits in ILP/TLP contributions University of Manchester - 2010

  25. Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) Tseq/Tmt University of Manchester - 2010

  26. Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) Tseq/T1p University of Manchester - 2010

  27. Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) (T1+T2)/(T1’+T2’) University of Manchester - 2010

  28. Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) • Use everything to compute TLP (Sovl) Sall/(Sseq x Silp) University of Manchester - 2010

  29. Outline • Introduction • Speculative multithreading models • Combined TLS+HT+RA scheme • Combined TLS+MP scheme • Performance model • Experimental setup and results • Conclusions University of Manchester - 2010

  30. Experimental Setup • Simulator, Compiler and Benchmarks: • SESC (http://sesc.sourceforge.net/) • POSH (Liu et al. PPoPP ‘06) • Spec 2000 Int. • Architecture: (for TLS+HT+RA scheme) • Four way CMP, 4-Issue cores • 16KB L1 Data (multi-versioned) and Instruction Caches • 1MB unified L2 Caches • Inst. window/ROB – 80/104 entries • 16KB Last Value Predictor University of Manchester - 2010

  31. Experimental Setup • Simulator, Compiler and Benchmarks: • SESC (http://sesc.sourceforge.net/) • POSH (Liu et al. PPoPP ‘06) • Spec 2000 Int. • Architecture: (for TLS+MP scheme) • Four way CMP, 4-Issue cores, 6 contexts / core • 32K-bit OGEHL, 1KByte BTB, 32-Entry RAS • 8 Kbit enhanced JRS confidence estimator • 32KB L1 Data (multi-versioned) and Instruction Caches • 1MB unified L2 Caches University of Manchester - 2010

  32. Results I TLS + HT + RA University of Manchester - 2010

  33. Comparing TLS, RunAhead and Unified Scheme University of Manchester - 2010

  34. Comparing TLS, RunAhead and Unified Scheme • Almost additive benefits University of Manchester - 2010

  35. Comparing TLS, RunAhead and Unified Scheme • Almost additive benefits • 10.2% over TLS, 18.3% over RA University of Manchester - 2010

  36. Understanding the extra ILP • Improvements of ILP come from: • Mainly memory • Branch prediction (improvement 0.5%) • Focus on memory: • Miss rate on committed path • Clustering of misses (different cost) University of Manchester - 2010

  37. Normalized Shared Cache Misses • All schemes better than sequential • Unified 41% better than sequential University of Manchester - 2010

  38. Isolated vs. Clustered Misses . • Both TLS + RA  Large window machines • Unified does even better University of Manchester - 2010

  39. Results II TLS + MP University of Manchester - 2010

  40. Impact of Branch Prediction on TLS • TLS emulates wider processor: • Removing mispredictions important (Amdahl) University of Manchester - 2010

  41. Branch Entropy for TLS • Much harder for TLS: • History partitioning • History re-order University of Manchester - 2010

  42. Increasing the Size of the Branch Predictor • Aliasing not much of a problem • Fundamental limitation is lack of history University of Manchester - 2010

  43. Designing a Better Predictor • Predictors that exploit longer histories not necessarily better .. University of Manchester - 2010

  44. Comparing TLS, MP and Combined TLS/MP University of Manchester - 2010

  45. Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor University of Manchester - 2010

  46. Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor • 9.2% over TLS, 28.2% over MP University of Manchester - 2010

  47. Pipeline Flushes • Significant amount of flush reductions • More than base MP! University of Manchester - 2010

  48. Outline • Introduction • Speculative multithreading models • Combined TLS+HT+RA scheme • Combined TLS+MP scheme • Performance model • Experimental setup and results • Conclusions University of Manchester - 2010

  49. Also in the ICS’09 paper … • Dealing with the load of the system • Converting TLS threads to HT • Multiple HT • Effect of a better VP • Detailed comparison of performance model against existing models (Renau et. al ICS ’05) University of Manchester - 2010

  50. Also in the HPCA’10 paper … • Detailed HW description • Impact of scheduling • Limiting MP to DP • Effect of scaling • Effect of a better CE University of Manchester - 2010

More Related