The potential for Software-only thread-level speculation

The potential for Software-only thread-level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members Prof. Tarek. Abdelrahman Prof. Michael Voss Prof. Ken Sevick By: Chuck (Chengyan) Zhao April 25, 2005

From all major companies: IBM: Power 4 Power 5 … Intel: Montecito Smithfield … AMD: Dual-core Opteron Sun: MAJC Sony, Toshiba, IBM: Cell … … Chip Multi-Processor (CMP) is now everywhere Power 4 Dual-core Intel chip Dual-core Opteron Cell Abundant Chip Multiprocessors

P P P P P C C C C C C C Improving Throughput with a Chip Multi-Processor Multiprogramming Workload: Applications Execution Time Processor Caches improve throughput

P P P P P P P P P C C C C C C C C C C C C Improving Single Application Performance with a Chip Multi-Processor Single Application:  Exec. Time need parallel threads to reduce execution time

Using Chip Multi-Processor for improvements • Improve throughput for multi-programming workload • Easy • CMP behaves like a normal MP • Improve single-application performance • Hard • Control and Data Dependence • Proposed approach: Thread-Level Speculation (TLS) CMP trade-offs

Run Time Compile Time Parallelize without dependency detection Commit Modification No Detect Violation Squash And Re-execute Yes Thread-Level Speculation (TLS) • Enable compiler to create parallel threads despite the existence of ambiguous data dependence • Optimistically parallelize at compile time • Detect violations and recover at runtime Optimistic at compile time, detect and recover at runtime

Example of Thread-Level Speculation Code to parallelize for ( …){ … *p = …; … … = … … *q; … } Un-parallelizable through paralleling compilers • Uncertain dependence between *p and *q • Might be runtime or user-input dependent Break loop iterations into threads, explore uncertainty in each thread

…*q violation *p…    Recover TLS Exec. Time …*q  exploit available thread-level parallelism How Thread-Level Speculation works 

Thread-Level Speculation quick summary • Benefits • Reduce inter-thread communication time among cores • Scale • New parallel programming model • Types of implementations • Hardware only • Combined with hardware and software • Software only Thread-Level Speculation is good for Chip Multi-Processor

Thread-Level Speculation SW-only approach HW-only approach Our approach Thread-Level Speculation Implementation Diagram Overall picture of Thread-Level Speculation

Thread-Level Speculation Implementation Comparison • Hardware-only approach • Lots of research • Good speed up through simulation • Nobody builds it yet • cost, risky, • need both HW + SW at the same time • Outcome • HW-only TLS looks promising • Significant hardware changes • Software-only approach: limited work, limited progress • Major problem: high overhead • Buffer memory for speculative states • Track each memory read + write: violation detection • Recover from failed speculation: re-execution Quick summary on HW-only and SW-only approaches

Outline for the rest of the talk • Hardware TLS schemes • Software TLS schemes • Our scheme • Our goals • Starting point • Potential applications • Conclusion

Thread-Level Speculation SW-only approach HW-only approach Our approach Hardware-only Thread-Level Speculation Overall picture of HW-only TLS approach

Hardware Thread-Level Speculation Schemes • Lots of hardware TLS research • CMU Stampede • Stanford Hydra • Wisconsin Multiscalar • UIUC IA-COMA • UMN Super-threaded architecture • … • Convergence of hardware schemes • Use cache to buffer speculative state • Extend cache coherence protocol to track data dependence Convergence of HW-only Thread-Level Speculation

Result TLS is promising SPEC int improvement: 30% - 100% Depends on aggressiveness of the hardware support P P P P C C C C C (non-speculative) Hardware TLS Schemes: quick summary Sp-state Sp-state Sp-state Sp-state CMP with hardware speculative buffer and enhanced cache consistence protocol Convergence of HW-only Thread-Level Speculation

Thread-Level Speculation SW-only approach HW-only approach Our approach Software-only Thread-Level Speculation Overall picture of SW-only TLS approach

Software-only Thread-Level Speculation Schemes • LRPD Test: UIUC • VM for dependence tracking: Spiros’s, CMU • Cintra’s SW TLS: U Edinburgh • Problem of software-only approach: high overhead • Try to reduce it overview of SW-only TLS approach

software dependence tracking was parallel execution safe? LRPD Test (UIUC) + implemented entirely in software – applies only to array-based code – no partial parallelism entire loop will re-execute sequentially if there is any dependence Exec. Time Pros + Cons of LRPD

Dependence tracking using Virtual Memory Exec. Time Software dependence tracking through VM pages Virtual Memory Synchronize: transfer VM pages ? Pros + Cons of VM Tracking

CMU Spiros’s approach -- Dependence tracking using Virtual Memory • Coarse-grain, software-only • Based on memory tracking • virtual memory page protection mechanism • use software DSM (TreadMarks) • Synchronization through VM pages through cost analysis • Overhead is prohibitive • 2 sec (seq) / 5 min (par) • Not a viable approach on this level of coarse granularity SW-TLS through VM Tracking is not attractive

Cintra’s SW TLS: Memory tracking tuned for performance Exec. Time Efficient tracking for array references Efficient but custom-made for array only

Cintra’s software-only Thread-Level Speculation: quick summary • Features • Software simulation for extended cache coherence protocol • Provide speculative state transition table • Violation detection through speculate state comparison • Instrument on each load and store • Pros + Cons: • + advanced implementation of LRPD test • + implement entirely in software • + cover partial parallelism • – hand-crafted code for performance • – apply only to array-based code Summary of Cintra’s work

Problems with Software Thread-Level Speculation • High overhead • Buffer speculative state • Track data dependence for all memory reference • Re-execute in case of failed speculation • Potential speedup • largely unexplored • Possible directions for future research • Reduce overhead • Achieve speedup from TLS parallelism Summary of Software TLS

Thread-Level Speculation SW-only approach HW-only approach Our approach Our current Thread-Level Speculation approach Overall position for our SW TLS approach

Long term future plan • Goals • Target • Chip Multi-Processors • Tightly-coupled MPs • Apply to general-purpose code: not only arrays • Minimize overhead • Capitalize on compiler analysis and optimizations • Idempotency analysis <done> • Synchronization and communications <done> • PPA: Probabilistic pointer analysis Framework (Jeff’s work) <progressing> • Minimal backup and buffer retrieval analysis <progressing> • … more analysis we will invent <todo> • SW-only approach: room to improve • Starting point: highly efficient software checkpointing Goals and Plans

Starting point: efficient software checkpointing program execution • Some program points in source code • Buffer state change between current execution point and its latest check point • Execution can always efficiently rewind to its latest checkpointing  Buffer memory changes Buffer more memory changes  Software checkpointing Introduce software checkpointing

Potential use of Software checkpointing • Software Rollback • automatic software TLS support • foundation of future automatic TLS parallelization • Debug • controlled rewind • Enhance application reliability • Speculative optimizations in uni-processor program • larger window size • deep branch speculation • speculative code motion what can software checkpointing do

Software checkpointing schemes • Compiler analysis • Local: Basic Block level • Backup only needed memory writes • Optimize to minimize • number of backup • Number of buffer retrieval • Global: procedural level • Populate buffers through control-flow graph • Iterate until buffer stabilizes • Inter-procedural level • Potential approaches for software backup • Undo backup • Todo backup build software checkpointing

Undo backup • Compile-time analysis • Backup once • per distinct memory write • per Basic Block • Program continue to operate on non-backup memory • Action upon execution completion • Commit: trash buffer • Rollback: restore from buffer undo backup properties

Undo backup example Program, Basic Block level Undo backup memory Undo backup action (&a, [a]) (&b, [b]) (&c, [c]) … a = 10; b = 12; … c = a + b; … conflicts check Y restore undo memory N trash undo memory Next Basic Block … undo backup process

Todo backup • Perform at runtime • Happen on each single memory write inside Basic Block • Each following read might need to retrieve from buffer • Action upon completion (reverse of Undo type) • Commit: write-back from buffer • Rollback: trash buffer todo backup properties

Todo backup example Program, Basic Block level todo backup memory (p, a) (q, b) … *p = a; *q = b; … …*p + *q; … conflicts check Y trash todo backup N write todo backup to memory Next Block … todo backup process

Backup Comparison • Undo • Pro: fast • Few number of backups • No need to retrieve from buffer for read • Con: Memory address needs to be known statically • Scalar • Pointer to fixed location • Todo • Pro • Handle both scalar and general-purpose pointer cases • Con: slow • Backup once per memory write • Need to retrieve each following read from buffer • In reality: both types are used pros + cons of undo and todo

An example in reality: mixed mode Code to execute Undo buffer int a, b, c; int * p, * q; … (d) a = 1; (d) b = 2; (d) *p = 5; … … (u) c = a + b; … … (u) … = * q; … (&a, [a]) (&b, [b]) (&c, [c]) Todo buffer (p, 5) combined-backup process in reality

Selection of backups in reality • Combined approach • Undo: memory address known • Scalars • Pointers to fixed address • Compile-time analysis • Todo: memory address unknown • Normal pointers • Run-time analysis • Plan for implementation • put into SUIF, as a optimization pass • Minimize performance drop use both types together in reality

Conclusion • Thread-Level Speculation is compelling • Potential large performance gains • Challenge • Software overhead • Limited SW TLS work • No previous SW TLS working on general-purpose programs • Killer advantage: compiler analyses • Modest starting point • efficient software checkpointing summary

Questions and Answers

Concurrent HW-only Related Work An other view of HW-only Thread-Level Speculation Schemes

The potential for Software-only thread-level speculation

The potential for Software-only thread-level speculation

Presentation Transcript

Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry Sc

Multiprocessors and Thread-Level Parallelism

Programming Explicit Thread-level Parallelism

Enabling Thread Level Speculation via A Transactional Memory System

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan

Circuit-Level Timing Speculation: The Razor Latch

Thread-Level Speculation as a Memory Consistency Protocol for Software DSM?

Combining Thread Level Speculation, Helper Threads, and Runahead Execution

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

The potential for Software-only thread-level speculation

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Exploiting Semantics and Speculation for Improving the Performance of Read-only Transactions

A Scalable Approach to Thread-Level Speculation

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

Thread Level Parallelism (TLP)

Applying Thread Level Speculation to Database Transactions

Chapter 5 Thread-Level Parallelism

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,