370 likes | 514 Views
Handling Branches in TLS Systems with Multi-Path Execution. University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA. Polychronis Xekalakis and Marcelo Cintra. Introduction. Power efficiency, complexity and time-to-market reasons lead to CMPs
E N D
Handling Branches in TLS Systems with Multi-Path Execution University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Polychronis Xekalakis and Marcelo Cintra
Introduction • Power efficiency, complexity and time-to-market reasons lead to CMPs • Many simple cores = high TLP but low ILP • Ok for throughput computing and embarrassingly parallel applications • Problem: • No benefits for sequential applications • Even for mostly parallel applications Amdahl’s Law limits performance gains with many cores • Solution: Speculative Multithreading (SM) HPCA 2010
Speculative Multithreading • Basic Idea: Use idle cores/contexts to speculate on future application needs • TLS: speculatively execute parallel threads • HT/RA: speculatively perform future memory operations • MP: speculatively execute along multiple branch targets • No SM model works best all times • Hardware infrastructure is very similar • Our Idea: Combine SM models and seamlessly exploit (speculative) TLP and/or ILP • In this work: TLS + MP • (for TLS +HT/RA see [ICS’09]) ICS 2009
Key Contributions • Analyze branch prediction for TLS Systems • Propose a mixed execution model that combines TLS with MP execution • We show that TLS allows MP to be more aggressive • Our approach outperforms state-of-the-art SM models: • TLS by 9.2% avg. (up to 23.2%) • MP by 28.2 % avg. (up to 138%) HPCA 2010
Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010
Thread Level Speculation • Compiler deals with: • Task selection • Code generation • HW deals with: • Different context • Spawn threads • Detecting violations • Replaying • Arbitrate commit Speculative Thread 1 Thread 2 Time HPCA 2010
Thread Level Speculation • Benefit: TLP/ILP • TLP (Overlapped Execution) • ILP (Prefetching) Speculative Speculative Thread 1 Thread 1 Thread 2 Thread 2 Time Time Overlapped Execution Prefetching HPCA 2010
MultiPath Execution • Compiler deals with: • Nothing • HW deals with: • Different context • When to do MP • Discard wrong path Main Thread Correct Paths Time MP Mode Wrong Paths HPCA 2010
MultiPath Execution • Benefit: • ILP (Branch Pred.) Main Thread Correct Paths Branch Misp. Cost Time Wrong Paths HPCA 2010
Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010
Impact of Branch Prediction on TLS • TLS emulates wider processor: • Removing mispredictions important (Amdahl) HPCA 2010
Branch Entropy for TLS • Much harder for TLS: • History partitioning • History re-order HPCA 2010
Increasing the Size of the Branch Predictor • Aliasing not much of a problem • Fundamental limitation is lack of history HPCA 2010
Designing a Better Predictor • Predictors that exploit longer histories not necessarily better .. HPCA 2010
Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010
Mixed Execution Model • When idle resources: • Try MP on top of TLS!! • Map TLS threads on empty cores • Map MP threads on empty contexts (same core) • Minimal extra HW: • Branch confidence estimator • MP bit – thread on MP mode • PATHS – how many outstanding branches • DIR – which path thread followed HPCA 2010
Combined TLS/MP Model Speculative Thread 1 Thread 2 Time HPCA 2010
Combined TLS/MP Model Speculative Thread 1 MP: 0 PATHS: 000 DIR: 000 Thread 1 Thread 2 Time Low Confidence Branch HPCA 2010
Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 1 PATHS: 001 DIR: 001 Multi-Path Mode HPCA 2010
Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 0 PATHS: 000 DIR: 000 Branch Resolved HPCA 2010
Intricacies to be Handled • How do we map TLS/MP threads? • Different mapping policies for TLS threads • Dealing with thread ordering • Correct data forwarding • Dealing with violations • While in “MP-Mode” delay restarts/kills/commits • No squashes on the wrong path • Thread spawning: • Delayed as well – keep contention low HPCA 2010
Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010
Experimental Setup • Simulator, Compiler and Benchmarks: • SESC (http://sesc.sourceforge.net/) • POSH (Liu et al. PPoPP ‘06) • Spec 2000 Int. • Architecture: • Four way CMP, 4-Issue cores, 6 contexts / core • 32K-bit OGEHL, 1KByte BTB, 32-Entry RAS • 8 Kbit enhanced JRS confidence estimator • 32KB L1 Data (multi-versioned) and Instruction Caches • 1MB unified L2 Caches HPCA 2010
Comparing TLS, MP and Combined TLS/MP HPCA 2010
Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor HPCA 2010
Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor • 9.2% over TLS, 28.2% over MP HPCA 2010
Pipeline Flushes • Significant amount of flush reductions • More than base MP! HPCA 2010
Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010
Also in the Paper … • Detailed HW description • Impact of scheduling • Limiting MP to DP • Effect of scaling • Effect of a better CE HPCA 2010
Conclusions • CMPs are here to stay: • What about single threaded apps. and apps with significant seq. sections? • We advocate the use of speculative multithreading • Analyzed branch prediction for modern TLS systems • Proposed a new mixed execution model • TLS is nicely complemented by MP • Unified scheme outperforms existing SM models • TLS by 9.2% avg. (up to 23.2%) • MP by 28.2 % avg. (up to 138%) HPCA 2010
Handling Branches in TLS Systems with Multi-Path Execution University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Polychronis Xekalakis and Marcelo Cintra
Backup Slides ICS 2009
Prediction Stats ICS 2009
Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) Tseq/Tmt
Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) Tseq/T1p
Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) (T1+T2)/(T1’+T2’)
Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) • Use everything to compute TLP (Sovl) Sall/(Sseq x Silp)