A Scalable Approach to Thread-Level Speculation

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Outline • Motivation • Thread level speculation (TLS) • Coherence scheme • Optimizations • Methodology • Results • Conclusion

Motivation • Leading chip manufactures going for multi-core architectures • Usually used to increase throughput • To exploit these parallel resources to increase performance – need to parallelize programs • Integer programs hard to parallelize • Use speculation – thread level speculation (TLS)!

Thread level speculation (TLS)

Scalable Approach • The paper aims to design a scalable approach which applies to wide variety of multi-processor like architectures • Only limitation is that the architecture should be shared memory based • The TLS is implemented over the invalidation based cache coherence protocol

Example • Each cache line has special bits • SL – speculative load has accessed the line • SM – the line is speculatively modified • Thread is squashed if • Line is present • SL is set • If epoch number indicates an earlier thread

Speculation level • We are concerned only with the speculation level – level in the cache hierarchy where the cache protocol begins • We can ignore all the other levels

Cache line states • Apart from the cache state bits we need SL and SM bits • A cache line with speculative bits set cannot be replaced • The thread is either squashed or the operation is delayed

Basic cache coherence protocol • When a processor wants to load a value, it atleast needs shared access to the line • When it wants to write, it needs exclusive access • Coherence mechanism issues invalidation message when it receives request for exclusive access

Coherence mechanism

Commit • When the homefree token arrives there is no possibility of further squashes • SpE is changed to E and SpS to S • Lines with SM bit set has to have D bit set • If a line is speculatively modified and shared, we have to get exclusive access for that line • Ownership required buffer (ORB) is used to track such lines

Squash • All speculatively modified lines have to be invalidated • SpE is changed to E and SpS to S

PerformanceOptimizations • Forwarding Data Between Epochs: • Predictable data dependences are synchronized • Dirty and Speculatively Loaded State: • Usually if a dirty line is speculatively loaded, it is flushed – this can be avoided • Suspending Violations: • When we have to evict a speculative line, we don’t need to squash

Multiple writers • If two epochs write to the same line – we have to squash one to avoid multiple writer problem • Possible to avoid this by maintaining fine grained disambiguation bits

Implementation

Epoch numbers • Has two parts – TID and sequence number • To avoid costly comparisons during every access – the difference is precomputed and a logically later mask is formed • Epoch numbers are maintained at one place for one chip

Speculative state implementation

Multiple writers - implementation • False violations are also handled in the same way

Correctness considerations • Speculation fails if the speculative state is lost • Exceptions are handled only when the homefree token is got • System calls are also postponed

Methodology • Detailed out-of-order simulation based on MIPS R10000 is done • Fork and other synchronization overhead is 10 cycles

Results • Normalized execution cycles

Results • Buk and equake – memory performance is a bottleneck • When increased more than 4 processors ijpeg performance degrades • Number of threads available is less • Some conflicts in cache

Overheads • Violations • Cache locality is important • ORB size can be further reduced – early release of ORB

Communication overhead • Buk is insensitive

Multiprocessor performance • Advantages • More cache storage • Disadvantage • Increased communication latency

Conclusion • By using TLS even integer programs can be parallelized to get speedup • The approach is scalable and can be applied to various other architectures which support multiple threads • There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible

Thanks!

A Scalable Approach to Thread-Level Speculation

A Scalable Approach to Thread-Level Speculation

Presentation Transcript

A Scalable Approach to Architectural-Level Reliability Prediction

Enabling Thread Level Speculation via A Transactional Memory System

A Scalable Machine Learning Approach to Go

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan

Thread-Level Speculation as a Memory Consistency Protocol for Software DSM?

Combining Thread Level Speculation, Helper Threads, and Runahead Execution

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

The potential for Software-only thread-level speculation

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

The potential for Software-only thread-level speculation

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Lightcuts: A Scalable Approach to Illumination

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

Thread Level Parallelism (TLP)

Applying Thread Level Speculation to Database Transactions

A Scalable Machine Learning Approach to Go

Lightcuts: A Scalable Approach to Illumination

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,