Handling Branches in TLS Systems with Multi-Path Execution

Handling Branches in TLS Systems with Multi-Path Execution University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Polychronis Xekalakis and Marcelo Cintra

Introduction • Power efficiency, complexity and time-to-market reasons lead to CMPs • Many simple cores = high TLP but low ILP • Ok for throughput computing and embarrassingly parallel applications • Problem: • No benefits for sequential applications • Even for mostly parallel applications Amdahl’s Law limits performance gains with many cores • Solution: Speculative Multithreading (SM) HPCA 2010

Speculative Multithreading • Basic Idea: Use idle cores/contexts to speculate on future application needs • TLS: speculatively execute parallel threads • HT/RA: speculatively perform future memory operations • MP: speculatively execute along multiple branch targets • No SM model works best all times • Hardware infrastructure is very similar • Our Idea: Combine SM models and seamlessly exploit (speculative) TLP and/or ILP • In this work: TLS + MP • (for TLS +HT/RA see [ICS’09]) ICS 2009

Key Contributions • Analyze branch prediction for TLS Systems • Propose a mixed execution model that combines TLS with MP execution • We show that TLS allows MP to be more aggressive • Our approach outperforms state-of-the-art SM models: • TLS by 9.2% avg. (up to 23.2%) • MP by 28.2 % avg. (up to 138%) HPCA 2010

Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010

Thread Level Speculation • Compiler deals with: • Task selection • Code generation • HW deals with: • Different context • Spawn threads • Detecting violations • Replaying • Arbitrate commit Speculative Thread 1 Thread 2 Time HPCA 2010

Thread Level Speculation • Benefit: TLP/ILP • TLP (Overlapped Execution) • ILP (Prefetching) Speculative Speculative Thread 1 Thread 1 Thread 2 Thread 2 Time Time Overlapped Execution Prefetching HPCA 2010

MultiPath Execution • Compiler deals with: • Nothing • HW deals with: • Different context • When to do MP • Discard wrong path Main Thread Correct Paths Time MP Mode Wrong Paths HPCA 2010

MultiPath Execution • Benefit: • ILP (Branch Pred.) Main Thread Correct Paths Branch Misp. Cost Time Wrong Paths HPCA 2010

Impact of Branch Prediction on TLS • TLS emulates wider processor: • Removing mispredictions important (Amdahl) HPCA 2010

Branch Entropy for TLS • Much harder for TLS: • History partitioning • History re-order HPCA 2010

Increasing the Size of the Branch Predictor • Aliasing not much of a problem • Fundamental limitation is lack of history HPCA 2010

Designing a Better Predictor • Predictors that exploit longer histories not necessarily better .. HPCA 2010

Mixed Execution Model • When idle resources: • Try MP on top of TLS!! • Map TLS threads on empty cores • Map MP threads on empty contexts (same core) • Minimal extra HW: • Branch confidence estimator • MP bit – thread on MP mode • PATHS – how many outstanding branches • DIR – which path thread followed HPCA 2010

Combined TLS/MP Model Speculative Thread 1 Thread 2 Time HPCA 2010

Combined TLS/MP Model Speculative Thread 1 MP: 0 PATHS: 000 DIR: 000 Thread 1 Thread 2 Time Low Confidence Branch HPCA 2010

Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 1 PATHS: 001 DIR: 001 Multi-Path Mode HPCA 2010

Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 0 PATHS: 000 DIR: 000 Branch Resolved HPCA 2010

Intricacies to be Handled • How do we map TLS/MP threads? • Different mapping policies for TLS threads • Dealing with thread ordering • Correct data forwarding • Dealing with violations • While in “MP-Mode” delay restarts/kills/commits • No squashes on the wrong path • Thread spawning: • Delayed as well – keep contention low HPCA 2010

Experimental Setup • Simulator, Compiler and Benchmarks: • SESC (http://sesc.sourceforge.net/) • POSH (Liu et al. PPoPP ‘06) • Spec 2000 Int. • Architecture: • Four way CMP, 4-Issue cores, 6 contexts / core • 32K-bit OGEHL, 1KByte BTB, 32-Entry RAS • 8 Kbit enhanced JRS confidence estimator • 32KB L1 Data (multi-versioned) and Instruction Caches • 1MB unified L2 Caches HPCA 2010

Comparing TLS, MP and Combined TLS/MP HPCA 2010

Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor HPCA 2010

Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor • 9.2% over TLS, 28.2% over MP HPCA 2010

Pipeline Flushes • Significant amount of flush reductions • More than base MP! HPCA 2010

Also in the Paper … • Detailed HW description • Impact of scheduling • Limiting MP to DP • Effect of scaling • Effect of a better CE HPCA 2010

Conclusions • CMPs are here to stay: • What about single threaded apps. and apps with significant seq. sections? • We advocate the use of speculative multithreading • Analyzed branch prediction for modern TLS systems • Proposed a new mixed execution model • TLS is nicely complemented by MP • Unified scheme outperforms existing SM models • TLS by 9.2% avg. (up to 23.2%) • MP by 28.2 % avg. (up to 138%) HPCA 2010

Handling Branches in TLS Systems with Multi-Path Execution University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Polychronis Xekalakis and Marcelo Cintra

Backup Slides ICS 2009

Prediction Stats ICS 2009

Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) Tseq/Tmt

Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) Tseq/T1p

Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) (T1+T2)/(T1’+T2’)

Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) • Use everything to compute TLP (Sovl) Sall/(Sseq x Silp)

Handling Branches in TLS Systems with Multi-Path Execution

Handling Branches in TLS Systems with Multi-Path Execution

Presentation Transcript

On Multi-Path Routing

TLS REFINEMENT WITH REFMAC5

Manufacturing Execution Systems

Radio Disjoint Multi-Path Routing in MANET

MULTI-PATH ROUTING

Coordinated Reinforcement Learning in Multi-Path Routing

Multi-level Error Handling

Challenges in Multi-Robot Path Planning

Designing Programs with Branches

Execution (Control Systems)

Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

Testing Programs with Branches

Exception Handling in Goal-Oriented Multi-Agent Systems

Handling Spatially Complex English-to-ASL MT with a Multi-Path Pyramidal Architecture

Path Planning for Multi Agent Systems

TLS

TLS

MULTI-PATH ROUTING

TLS

Conveyor Handling Systems

Handling Spatially Complex English-to-ASL MT with a Multi-Path Pyramidal Architecture

Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures