Mixed Speculative Multithreaded Execution Models

Mixed Speculative Multithreaded Execution Models University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Marcelo Cintra

Context and Motivation • Multi-cores are here to stay and many-cores are coming • Excellent performance for embarrassingly parallel or throughput workloads. Otherwise, ... Core Duo (Yonah - 2006) Core i7 (Lynnfield - 2010) SCC (2015?) University of Manchester - 2010

Context and Motivation • Future many-cores will have many idle cores too often • Not enough applications • Not enough benefit from using more “explicit user threads” • Our proposal: use spare cores to accelerate whatever threads are available • Create implicit threads to run in parallel with main explicit user threads • Accelerate user threads through increased coarse-grain overlap (i.e., TLP) or increased fine-grain overlap (i.e., ILP) • Combine previously proposed speculative multithreading techniques: thread-level speculation (TLS), helper threads (HT), run-ahead execution (RA), and multi-path execution (MP) University of Manchester - 2010

Context and Motivation • Why is combining SM schemes a good idea? • No speculative multithreading scheme alone is good enough • Hardware support for all schemes is very similar • Expected end-result? • Better performance • More “effective” many-core experience: • With 4-way speculative multithreading (i.e., 4 implicit SM threads for each explicit user thread) a 64 core “unwieldy” many-core is as “easy to handle” as a 16 core system • How about power efficiency? • Speculation can be made less inefficient (we’re working on it) • Power can be smartly allocated (see our IPDPS’10 paper) University of Manchester - 2010

Contributions • Introduce mixed Speculative Multithreading (SM) Execution Models • Design and evaluated two combinations: TLS+HT+RA [ICS’09] and TLS+MP [HPCA’10] • Propose a performance model able to quantify ILP and TLP benefits • Combined approaches outperform state-of-the-art SM models: • TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and RA by 18.3 % avg. (up to 35.2%) • TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by 28.2 % avg. (up to 138%) University of Manchester - 2010

Outline • Introduction • Speculative multithreading models • Combined TLS+HT+RA scheme • Combined TLS+MP scheme • Performance model • Experimental setup and results • Conclusions University of Manchester - 2010

Speculative Multithreading • Basic Idea: Use idle cores/contexts to speculate on future application needs • TLS: speculatively execute parallel threads • HT/RA: speculatively perform future memory operations • MP: speculatively execute along multiple branch targets • Speculative threads supported in hardware • Compiler support not essential, but can be useful • Hardware infrastructure is very similar University of Manchester - 2010

Thread Level Speculation • Compiler deals with: • Task selection • Code generation • HW deals with: • Different context • Spawn threads • Detecting violations • Replaying • Arbitrate commit • Benefit: TLP/ILP • TLP (Overlapped Execution) • + ILP (Prefetching) University of Manchester - 2010

Helper Threads • Compiler deals with: • Memory ops miss/ hard-to-predict branches • Backward slices • HW deals with: • Spawn threads • Different context • Discard when finished • Benefit: • ILP (Prefetch/Warmup) University of Manchester - 2010

RunAhead Execution • Compiler deals with: • Nothing • HW deals with: • Different context • When to do RA • VP Memory • Commit/Discard • Benefit: • ILP (Prefetch/Warmup) University of Manchester - 2010

MultiPath Execution • Compiler deals with: • Nothing • HW deals with: • Different context • When to do MP • Discard wrong path • Benefit: • ILP (Branch Pred.) Main Thread Correct Paths Branch Misp. Cost Time Wrong Paths University of Manchester - 2010

Combining TLS, HT and RA • Start with TLS • Provide support to clone TLS threads and convert them to HT • Conversion to HT means: • Put them in RA mode • Suppress squashes and do not cause additional squashes • Discard them when they finish • No compiler slicing  purely HW approach University of Manchester - 2010

Intricacies to be Handled • HT may not prefetch effectively! • Dealing with contention • HT threads much faster  saturate BW • Dealing with thread ordering • TLS imposes total thread order • HT killed  squashes TLS threads University of Manchester - 2010

Creating and Terminating HT • Create a HT on a L2 miss we can VP • Use mem. address based confidence estimator • VP only if confident • Create a HT if we have a free processor • Only allow most speculative thread to clone • Seamless integration of HT with TLS • BUT: if parent no longer the most spec. TLS thread, the HT has to be killed • Additionally kill HT when: • Parent/HT thread finishes • HT causes exception University of Manchester - 2010

Mixed Execution Model • When idle resources: • Try MP on top of TLS!! • Map TLS threads on empty cores • Map MP threads on empty contexts (same core) • Minimal extra HW: • Branch confidence estimator • MP bit – thread on MP mode • PATHS – how many outstanding branches • DIR – which path thread followed University of Manchester - 2010

Combined TLS/MP Model Speculative Thread 1 Thread 2 Time University of Manchester - 2010

Combined TLS/MP Model Speculative Thread 1 MP: 0 PATHS: 000 DIR: 000 Thread 1 Thread 2 Time Low Confidence Branch University of Manchester - 2010

Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 1 PATHS: 001 DIR: 001 Multi-Path Mode University of Manchester - 2010

Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 0 PATHS: 000 DIR: 000 Branch Resolved University of Manchester - 2010

Intricacies to be Handled • How do we map TLS/MP threads? • Different mapping policies for TLS threads • Dealing with thread ordering • Correct data forwarding • Dealing with violations • While in “MP-Mode” delay restarts/kills/commits • No squashes on the wrong path • Thread spawning: • Delayed as well – keep contention low University of Manchester - 2010

Understanding Performance Benefits • Complex TLS thread interactions, obscure performance benefits • Even more true for mixed execution models • We need a way to quantify ILP and TLP contributions to bottom-line performance • Proposed model: • Able to break benefits in ILP/TLP contributions University of Manchester - 2010

Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) Tseq/Tmt University of Manchester - 2010

Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) Tseq/T1p University of Manchester - 2010

Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) (T1+T2)/(T1’+T2’) University of Manchester - 2010

Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) • Use everything to compute TLP (Sovl) Sall/(Sseq x Silp) University of Manchester - 2010

Experimental Setup • Simulator, Compiler and Benchmarks: • SESC (http://sesc.sourceforge.net/) • POSH (Liu et al. PPoPP ‘06) • Spec 2000 Int. • Architecture: (for TLS+HT+RA scheme) • Four way CMP, 4-Issue cores • 16KB L1 Data (multi-versioned) and Instruction Caches • 1MB unified L2 Caches • Inst. window/ROB – 80/104 entries • 16KB Last Value Predictor University of Manchester - 2010

Experimental Setup • Simulator, Compiler and Benchmarks: • SESC (http://sesc.sourceforge.net/) • POSH (Liu et al. PPoPP ‘06) • Spec 2000 Int. • Architecture: (for TLS+MP scheme) • Four way CMP, 4-Issue cores, 6 contexts / core • 32K-bit OGEHL, 1KByte BTB, 32-Entry RAS • 8 Kbit enhanced JRS confidence estimator • 32KB L1 Data (multi-versioned) and Instruction Caches • 1MB unified L2 Caches University of Manchester - 2010

Results I TLS + HT + RA University of Manchester - 2010

Comparing TLS, RunAhead and Unified Scheme University of Manchester - 2010

Comparing TLS, RunAhead and Unified Scheme • Almost additive benefits University of Manchester - 2010

Comparing TLS, RunAhead and Unified Scheme • Almost additive benefits • 10.2% over TLS, 18.3% over RA University of Manchester - 2010

Understanding the extra ILP • Improvements of ILP come from: • Mainly memory • Branch prediction (improvement 0.5%) • Focus on memory: • Miss rate on committed path • Clustering of misses (different cost) University of Manchester - 2010

Normalized Shared Cache Misses • All schemes better than sequential • Unified 41% better than sequential University of Manchester - 2010

Isolated vs. Clustered Misses . • Both TLS + RA  Large window machines • Unified does even better University of Manchester - 2010

Results II TLS + MP University of Manchester - 2010

Impact of Branch Prediction on TLS • TLS emulates wider processor: • Removing mispredictions important (Amdahl) University of Manchester - 2010

Branch Entropy for TLS • Much harder for TLS: • History partitioning • History re-order University of Manchester - 2010

Increasing the Size of the Branch Predictor • Aliasing not much of a problem • Fundamental limitation is lack of history University of Manchester - 2010

Designing a Better Predictor • Predictors that exploit longer histories not necessarily better .. University of Manchester - 2010

Comparing TLS, MP and Combined TLS/MP University of Manchester - 2010

Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor University of Manchester - 2010

Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor • 9.2% over TLS, 28.2% over MP University of Manchester - 2010

Pipeline Flushes • Significant amount of flush reductions • More than base MP! University of Manchester - 2010

Also in the ICS’09 paper … • Dealing with the load of the system • Converting TLS threads to HT • Multiple HT • Effect of a better VP • Detailed comparison of performance model against existing models (Renau et. al ICS ’05) University of Manchester - 2010

Also in the HPCA’10 paper … • Detailed HW description • Impact of scheduling • Limiting MP to DP • Effect of scaling • Effect of a better CE University of Manchester - 2010

Mixed Speculative Multithreaded Execution Models

Mixed Speculative Multithreaded Execution Models

Presentation Transcript

Mixed Models

Lazy and Speculative Execution

Copula Models and Speculative Price Dynamics

SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM

Speculative Execution in a Distributed File System

A Configurable Simulator for OOO Speculative Execution

Trace-Level Speculative Multithreaded Architecture

Mixed models

Processor Verification with Precise Exceptions and Speculative Execution

Speculative Execution in a Distributed File System

Out-of-Order Speculative Execution

“Mixed layer” models

Copula Models and Speculative Price Dynamics

Mixed Linear Models

Lecture 7 : Speculative Execution and Recovery

Mixed Effects Models

Lazy and Speculative Execution

Mixed Linear Models

Mixed Linear Models

Out-of-Order Speculative Execution