1 / 27

A Scalable Approach to Thread-Level Speculation

A Scalable Approach to Thread-Level Speculation. J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University. Outline. Motivation Thread level speculation (TLS) Coherence scheme Optimizations Methodology Results Conclusion. Motivation.

ghazi
Download Presentation

A Scalable Approach to Thread-Level Speculation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

  2. Outline • Motivation • Thread level speculation (TLS) • Coherence scheme • Optimizations • Methodology • Results • Conclusion

  3. Motivation • Leading chip manufactures going for multi-core architectures • Usually used to increase throughput • To exploit these parallel resources to increase performance – need to parallelize programs • Integer programs hard to parallelize • Use speculation – thread level speculation (TLS)!

  4. Thread level speculation (TLS)

  5. Scalable Approach • The paper aims to design a scalable approach which applies to wide variety of multi-processor like architectures • Only limitation is that the architecture should be shared memory based • The TLS is implemented over the invalidation based cache coherence protocol

  6. Example • Each cache line has special bits • SL – speculative load has accessed the line • SM – the line is speculatively modified • Thread is squashed if • Line is present • SL is set • If epoch number indicates an earlier thread

  7. Speculation level • We are concerned only with the speculation level – level in the cache hierarchy where the cache protocol begins • We can ignore all the other levels

  8. Cache line states • Apart from the cache state bits we need SL and SM bits • A cache line with speculative bits set cannot be replaced • The thread is either squashed or the operation is delayed

  9. Basic cache coherence protocol • When a processor wants to load a value, it atleast needs shared access to the line • When it wants to write, it needs exclusive access • Coherence mechanism issues invalidation message when it receives request for exclusive access

  10. Coherence mechanism

  11. Commit • When the homefree token arrives there is no possibility of further squashes • SpE is changed to E and SpS to S • Lines with SM bit set has to have D bit set • If a line is speculatively modified and shared, we have to get exclusive access for that line • Ownership required buffer (ORB) is used to track such lines

  12. Squash • All speculatively modified lines have to be invalidated • SpE is changed to E and SpS to S

  13. PerformanceOptimizations • Forwarding Data Between Epochs: • Predictable data dependences are synchronized • Dirty and Speculatively Loaded State: • Usually if a dirty line is speculatively loaded, it is flushed – this can be avoided • Suspending Violations: • When we have to evict a speculative line, we don’t need to squash

  14. Multiple writers • If two epochs write to the same line – we have to squash one to avoid multiple writer problem • Possible to avoid this by maintaining fine grained disambiguation bits

  15. Implementation

  16. Epoch numbers • Has two parts – TID and sequence number • To avoid costly comparisons during every access – the difference is precomputed and a logically later mask is formed • Epoch numbers are maintained at one place for one chip

  17. Speculative state implementation

  18. Multiple writers - implementation • False violations are also handled in the same way

  19. Correctness considerations • Speculation fails if the speculative state is lost • Exceptions are handled only when the homefree token is got • System calls are also postponed

  20. Methodology • Detailed out-of-order simulation based on MIPS R10000 is done • Fork and other synchronization overhead is 10 cycles

  21. Results • Normalized execution cycles

  22. Results • Buk and equake – memory performance is a bottleneck • When increased more than 4 processors ijpeg performance degrades • Number of threads available is less • Some conflicts in cache

  23. Overheads • Violations • Cache locality is important • ORB size can be further reduced – early release of ORB

  24. Communication overhead • Buk is insensitive

  25. Multiprocessor performance • Advantages • More cache storage • Disadvantage • Increased communication latency

  26. Conclusion • By using TLS even integer programs can be parallelized to get speedup • The approach is scalable and can be applied to various other architectures which support multiple threads • There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible

  27. Thanks!

More Related