280 likes | 413 Views
A Scalable Approach to Thread-Level Speculation. J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University. Outline. Motivation Thread level speculation (TLS) Coherence scheme Optimizations Methodology Results Conclusion. Motivation.
E N D
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University
Outline • Motivation • Thread level speculation (TLS) • Coherence scheme • Optimizations • Methodology • Results • Conclusion
Motivation • Leading chip manufactures going for multi-core architectures • Usually used to increase throughput • To exploit these parallel resources to increase performance – need to parallelize programs • Integer programs hard to parallelize • Use speculation – thread level speculation (TLS)!
Scalable Approach • The paper aims to design a scalable approach which applies to wide variety of multi-processor like architectures • Only limitation is that the architecture should be shared memory based • The TLS is implemented over the invalidation based cache coherence protocol
Example • Each cache line has special bits • SL – speculative load has accessed the line • SM – the line is speculatively modified • Thread is squashed if • Line is present • SL is set • If epoch number indicates an earlier thread
Speculation level • We are concerned only with the speculation level – level in the cache hierarchy where the cache protocol begins • We can ignore all the other levels
Cache line states • Apart from the cache state bits we need SL and SM bits • A cache line with speculative bits set cannot be replaced • The thread is either squashed or the operation is delayed
Basic cache coherence protocol • When a processor wants to load a value, it atleast needs shared access to the line • When it wants to write, it needs exclusive access • Coherence mechanism issues invalidation message when it receives request for exclusive access
Commit • When the homefree token arrives there is no possibility of further squashes • SpE is changed to E and SpS to S • Lines with SM bit set has to have D bit set • If a line is speculatively modified and shared, we have to get exclusive access for that line • Ownership required buffer (ORB) is used to track such lines
Squash • All speculatively modified lines have to be invalidated • SpE is changed to E and SpS to S
PerformanceOptimizations • Forwarding Data Between Epochs: • Predictable data dependences are synchronized • Dirty and Speculatively Loaded State: • Usually if a dirty line is speculatively loaded, it is flushed – this can be avoided • Suspending Violations: • When we have to evict a speculative line, we don’t need to squash
Multiple writers • If two epochs write to the same line – we have to squash one to avoid multiple writer problem • Possible to avoid this by maintaining fine grained disambiguation bits
Epoch numbers • Has two parts – TID and sequence number • To avoid costly comparisons during every access – the difference is precomputed and a logically later mask is formed • Epoch numbers are maintained at one place for one chip
Multiple writers - implementation • False violations are also handled in the same way
Correctness considerations • Speculation fails if the speculative state is lost • Exceptions are handled only when the homefree token is got • System calls are also postponed
Methodology • Detailed out-of-order simulation based on MIPS R10000 is done • Fork and other synchronization overhead is 10 cycles
Results • Normalized execution cycles
Results • Buk and equake – memory performance is a bottleneck • When increased more than 4 processors ijpeg performance degrades • Number of threads available is less • Some conflicts in cache
Overheads • Violations • Cache locality is important • ORB size can be further reduced – early release of ORB
Communication overhead • Buk is insensitive
Multiprocessor performance • Advantages • More cache storage • Disadvantage • Increased communication latency
Conclusion • By using TLS even integer programs can be parallelized to get speedup • The approach is scalable and can be applied to various other architectures which support multiple threads • There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible