1 / 81

Scalable Thread-Level Speculation Approach for Multithreaded Architectures

This study explores a scalable approach to Thread-Level Speculation (TLS) for multithreaded machines like SUN MAJC, IBM Power4, and ALPHA 21464. TLS allows for parallelism in complex code by guessing and validating thread independence. The method aims to handle all memory accesses, maintain performance, and be applicable to various multithreaded architectures. The proposed scheme facilitates coherence and scales effectively, drawing from related schemes like Multiscalar and Dynamic Multithreading. The approach is outlined with a focus on the epoch life cycle, speculative coherence, and commit time behavior. Mechanisms for squashing or committing speculative work ensure data integrity.

pruneda
Download Presentation

Scalable Thread-Level Speculation Approach for Multithreaded Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Computer Science Department Carnegie Mellon University

  2. P P P P C C C C C C C Shared Memory Multithreaded Machines Are Everywhere Threads P P C C P C C C C Shared Memory SUN MAJC, IBM Power4 ALPHA 21464 Dual Pentium SGI Origin How can we use them? Parallelism!

  3. Automatic Parallelization Proving independence of threads is hard: • complex control flow • complex data structures • pointers, pointers, pointers • run-time inputs How can we make the compiler’s job feasible? Thread-Level Speculation (TLS)

  4. Time Example Processor = hash[3] … hash[10]= … • while (...){ • x = hash[index1]; • … • hash[index2] = y; • ... • } = hash[19] … hash[21]= … = hash[33] … hash[30]= … = hash[10] … hash[25]= …

  5. Time Example of Thread-Level Speculation Processor Processor Processor Processor Epoch 1 = hash[3] … hash[10] = … Epoch 2 = hash[19] … hash[21]= … Epoch 3 = hash[33] … hash[30]= … Epoch 4 = hash[10] … hash[25]= …

  6. Time Example of Thread-Level Speculation Processor Processor Processor Processor Epoch 1 Epoch 2 Epoch 3 Epoch 4 = hash[3] … hash[10] = … = hash[19] … hash[21]= … = hash[33] … hash[30]= … = hash[10] … hash[25]= … Violation!

  7. Time Example of Thread-Level Speculation Processor Processor Processor Processor Epoch 1 Epoch 2 Epoch 3 Epoch 4 = hash[3] … hash[10] = … commit? = hash[19] … hash[21]= … commit? = hash[33] … hash[30]= … commit? = hash[10] … hash[25]= … commit? Violation!    

  8. Time Retry Example of Thread-Level Speculation Processor Processor Processor Processor Epoch 1 Epoch 2 Epoch 3 Epoch 4 = hash[3] … hash[10] = … commit? = hash[19] … hash[21]= … commit? = hash[33] … hash[30]= … commit? = hash[10] … hash[25]= … commit? Violation!     Epoch 4 = hash[10] … hash[25]= … commit? 

  9. Goals of Our Approach 1) Handle arbitrary memory accesses • i.e. not just array references 2) Preserve performance of non-speculative workloads • keep hardware support minimal and simple 3) Apply to any scale of multithreaded architecture • CMPs, SMT processors, more traditional MPs effective, simple, and scalable TLS

  10. Overview of Our Approach System requirements: 1) Detect data dependence violations • extend invalidation-based cache coherence 2) Buffer speculative modifications • use the caches as speculative buffers coherence already works at a variety of scales hence our scheme is also scalable

  11. Related Schemes • Wisconsin (Multiscalar, Trace Processor) • Stanford (Hydra) • U.P. Catalunya (Speculative Multithreading) • Intel/U. Portland (Dynamic Multithreading) • Illinois at U.C. (I-ACOMA) our approach seamlessly scales both up and down

  12. Outline Details of our Approach • life cycle of an epoch • speculative coherence • what happens at commit time • forwarding data between epochs • Performance • Conclusions

  13. Time Slow Commit: Becomes Complete, Speculative Pass Homefree Fast Commit: Life Cycle of an Epoch Spawned Init   Speculative  Work  Commit? Wait to be Homefree?    

  14. Time Becomes Complete, Speculative Pass Homefree Mechanisms to Squash or Commit Life Cycle of an Epoch Spawned Speculative Coherence Commit?

  15. Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) MESI Coherence Example Thread A: Thread B: Processor Processor Cache Cache

  16. Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) MESI Coherence Example Load X Thread A: Thread B: Processor Processor Cache Cache Read

  17. Data Data State State Tag Tag Excl. Invalid - X 2 - Shared Memory (X=2) MESI Coherence Example Load X Thread A: Thread B: Processor Processor Cache Cache Read Fill

  18. Data Data State State Tag Tag Excl. Invalid X - 2 - Shared Memory (X=2) MESI Coherence Example Load X Store X=3 Thread A: Thread B: Processor Processor Cache Cache Read-Exclusive read-exclusive invalidates all other copies

  19. Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) MESI Coherence Example Load X Store X=3 Thread A: Thread B: Processor Processor Cache Cache Read-Exclusive Invalidation read-exclusive invalidates all other copies

  20. Data Data State State Tag Tag Dirty Invalid - X - 3 Shared Memory (X ) MESI Coherence Example Load X Store X=3 Thread A: Thread B: Processor Processor Cache Cache Fill Read-Exclusive Invalidation the state ‘dirty’ implies exclusiveness

  21. Speculative Coherence Example Highlights of our scheme: • detection of a data dependence violation • speculatively modifiedandshared cache lines Epoch6: Epoch4: Epoch5: Load X Store X=3 Load X

  22. Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) Speculative Coherence Example Load X Epoch5: Epoch6: Processor Processor Cache Cache Read

  23. Data Data State State Tag Tag Excl. Invalid X - - 2 Shared Memory (X=2) Speculative Coherence Example Load X Epoch5: Epoch6: Processor Processor Cache Cache Spec. Loaded Read Fill track which lines are speculatively loaded

  24. Data Data State State Tag Tag Invalid Excl. - X - 2 Shared Memory (X=2) Speculative Coherence Example Load X Epoch5: Epoch6: Store X=3 Processor Processor Cache Cache Spec. Loaded Sp Read-Ex (epoch5) speculative msgs piggyback epoch number

  25. Data Data State State Tag Tag Invalid Excl. - X - 2 Shared Memory (X=2) Speculative Coherence Example Load X Epoch5: Epoch6: Store X=3 Processor Processor Cache Cache Spec. Loaded Sp Read-Ex (epoch5) Sp Inv (epoch5) epoch5 < epoch6, and speculatively loaded

  26. Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) Speculative Coherence Example Load X  speculation failed! Epoch5: Epoch6: Store X=3 Processor Processor Cache Cache Sp Read-Ex (epoch5) Sp Inv (epoch5) speculation fails for epoch 6

  27. Data Data State State Tag Tag Excl. Invalid X - - 3 Shared Memory (X=2) Speculative Coherence Example Load X  speculation failed! Epoch5: Epoch6: Store X=3 Processor Processor Cache Cache Spec. Modified Fill Sp Read-Ex (epoch5) Sp Inv (epoch5) track which lines are speculatively modified

  28. Speculative Coherence Example Highlights of our scheme: • detection of a data dependence violation • speculatively modifiedandshared cache lines Epoch6: Epoch4: Epoch5: Load X Store X=3 Load X

  29. Epoch4: Processor Cache Data Data State State Tag Tag Excl. Invalid - X - 3 Shared Memory (X=2) Speculative Coherence Example Epoch5: Store X=3 Processor Cache Spec. Modified

  30. Data Data State State Tag Tag Invalid Excl. - X - 3 Shared Memory (X=2) Speculative Coherence Example Epoch4: Epoch5: Store X=3 Load X Processor Processor Cache Cache Spec. Modified Read

  31. Data Data State State Tag Tag Invalid X - - 3 Shared Memory (X=2) Speculative Coherence Example Epoch4: Epoch5: Store X=3 Load X Processor Processor Cache Cache Spec. Modified Shared Read notify shared both speculatively modified and shared!

  32. Data Data State State Tag Tag X X 3 2 Shared Memory (X=2) Speculative Coherence Example Epoch4: Epoch5: Store X=3 Load X Processor Processor Cache Cache Spec. Loaded Spec. Modified Shared Shared Fill Read notify shared multiple versions of the same cache line

  33. Summary of New Speculative Line State New cache line state: • has it been speculatively loaded? • detect dependence violations • has it been speculatively modified? • buffer speculative modifications • is it in a speculative shared or exclusive state? • important performance optimizations What if a speculative cache line is replaced? • speculation fails for that epoch

  34. - - - - - - - - - - - - Implementation of Speculative State Processor Cache Data State Tag

  35. Tag SL SM - - - - - - - - - - - - - - - - - - - - Implementation of Speculative State Processor Cache Speculatively Loaded Data State Speculatively Modified modest amount of extra space

  36. Time Becomes Complete, Speculative Pass Homefree Squash Life Cycle of an Epoch Spawned Speculative Coherence Commit? Mechanisms to Squash or Commit

  37. Flash Reset When Speculation Fails Processor Cache Data State Tag SM SL Sp Ex * * 0 1 Sp Sh * * 0 1 Sp Ex * * 1 0 Sp Sh * * 1 1

  38. If Set then Invalidate; Flash Reset When Speculation Fails Processor Cache Data State Tag SM SL Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 * * 1 0 Sp Sh

  39. When Speculation Fails Processor Cache Data State Tag SM SL Excl * * 0 0 * * 0 0 Shared Invalid * * 0 0 Invalid * * 0 0 quick bit operation

  40. Time Becomes Complete, Speculative Pass Homefree Commit Life Cycle of an Epoch Spawned Speculative Coherence Commit? Mechanisms to Squash or Commit

  41. Flash Reset When Speculation Succeeds Processor Cache Data State Tag SM SL Sp Ex * * 0 1 Sp Sh * * 0 1 Sp Ex * * 1 0 Sp Sh * * 1 1

  42. SM & Exclusive: Become Dirty When Speculation Succeeds Processor Cache Data State Tag SM SL Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 Sp Sh * * 1 0

  43. SM & Shared: Need Exclusive Access When Speculation Succeeds Processor Cache Data State Tag SM SL Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 Sp Sh * * 1 0 want to avoid searching entire cache

  44. When Speculation Succeeds Processor Cache Data State Tag SM SL ORB Excl * * 0 0 - * * 0 0 Shared - X Sp Ex * * 1 0 Sp Sh X * 1 0 ownership required buffer (ORB)

  45. - - X When Speculation Succeeds Processor Cache Data State Tag SM SL ORB Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 Sp Sh X * 1 0 Upgrade-Request (X)

  46. - If SM, Become Dirty; Flash Reset - - When Speculation Succeeds Processor Cache Data State Tag SM SL ORB Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 Sp Sh X * 1 0 Ack (X) Upgrade-Request (X)

  47. - - - When Speculation Succeeds Processor Cache Data State Tag SM SL ORB Excl * * 0 0 * * 0 0 Shared Dirty * * 0 0 Dirty X * 0 0 flush the ORB, then quick bit operations

  48. Forwarding Data Between Epochs • predictable dependences cause frequent violations • compiler inserts wait-signal synchronization Load X Store X Wait Store X Signal Load X   With  Forwarding  synchronize to avoid violations

  49. Outline • Details of our Approach Performance • simulation infrastructure • single-chip multiprocessor performance • scaling beyond chip boundaries • Conclusions

  50. C C C P P Crossbar Simulation Infrastructure Compiler system and tools based on SUIF • help analyze dependences, insert synchronization • produce MIPS binaries containing TLS primitives Benchmarks (all run to completion) • buk, compress95, ijpeg, equake Simulator • superscalar, similar to MIPS R10K • models all bandwidth and contention detailed simulation!

More Related