810 likes | 932 Views
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Computer Science Department Carnegie Mellon University. P. P. P. P. C. C. C. C. C. C. C. Shared Memory. Multithreaded Machines Are Everywhere. Threads.
E N D
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Computer Science Department Carnegie Mellon University
P P P P C C C C C C C Shared Memory Multithreaded Machines Are Everywhere Threads P P C C P C C C C Shared Memory SUN MAJC, IBM Power4 ALPHA 21464 Dual Pentium SGI Origin How can we use them? Parallelism!
Automatic Parallelization Proving independence of threads is hard: • complex control flow • complex data structures • pointers, pointers, pointers • run-time inputs How can we make the compiler’s job feasible? Thread-Level Speculation (TLS)
Time Example Processor = hash[3] … hash[10]= … • while (...){ • x = hash[index1]; • … • hash[index2] = y; • ... • } = hash[19] … hash[21]= … = hash[33] … hash[30]= … = hash[10] … hash[25]= …
Time Example of Thread-Level Speculation Processor Processor Processor Processor Epoch 1 = hash[3] … hash[10] = … Epoch 2 = hash[19] … hash[21]= … Epoch 3 = hash[33] … hash[30]= … Epoch 4 = hash[10] … hash[25]= …
Time Example of Thread-Level Speculation Processor Processor Processor Processor Epoch 1 Epoch 2 Epoch 3 Epoch 4 = hash[3] … hash[10] = … = hash[19] … hash[21]= … = hash[33] … hash[30]= … = hash[10] … hash[25]= … Violation!
Time Example of Thread-Level Speculation Processor Processor Processor Processor Epoch 1 Epoch 2 Epoch 3 Epoch 4 = hash[3] … hash[10] = … commit? = hash[19] … hash[21]= … commit? = hash[33] … hash[30]= … commit? = hash[10] … hash[25]= … commit? Violation!
Time Retry Example of Thread-Level Speculation Processor Processor Processor Processor Epoch 1 Epoch 2 Epoch 3 Epoch 4 = hash[3] … hash[10] = … commit? = hash[19] … hash[21]= … commit? = hash[33] … hash[30]= … commit? = hash[10] … hash[25]= … commit? Violation! Epoch 4 = hash[10] … hash[25]= … commit?
Goals of Our Approach 1) Handle arbitrary memory accesses • i.e. not just array references 2) Preserve performance of non-speculative workloads • keep hardware support minimal and simple 3) Apply to any scale of multithreaded architecture • CMPs, SMT processors, more traditional MPs effective, simple, and scalable TLS
Overview of Our Approach System requirements: 1) Detect data dependence violations • extend invalidation-based cache coherence 2) Buffer speculative modifications • use the caches as speculative buffers coherence already works at a variety of scales hence our scheme is also scalable
Related Schemes • Wisconsin (Multiscalar, Trace Processor) • Stanford (Hydra) • U.P. Catalunya (Speculative Multithreading) • Intel/U. Portland (Dynamic Multithreading) • Illinois at U.C. (I-ACOMA) our approach seamlessly scales both up and down
Outline Details of our Approach • life cycle of an epoch • speculative coherence • what happens at commit time • forwarding data between epochs • Performance • Conclusions
Time Slow Commit: Becomes Complete, Speculative Pass Homefree Fast Commit: Life Cycle of an Epoch Spawned Init Speculative Work Commit? Wait to be Homefree?
Time Becomes Complete, Speculative Pass Homefree Mechanisms to Squash or Commit Life Cycle of an Epoch Spawned Speculative Coherence Commit?
Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) MESI Coherence Example Thread A: Thread B: Processor Processor Cache Cache
Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) MESI Coherence Example Load X Thread A: Thread B: Processor Processor Cache Cache Read
Data Data State State Tag Tag Excl. Invalid - X 2 - Shared Memory (X=2) MESI Coherence Example Load X Thread A: Thread B: Processor Processor Cache Cache Read Fill
Data Data State State Tag Tag Excl. Invalid X - 2 - Shared Memory (X=2) MESI Coherence Example Load X Store X=3 Thread A: Thread B: Processor Processor Cache Cache Read-Exclusive read-exclusive invalidates all other copies
Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) MESI Coherence Example Load X Store X=3 Thread A: Thread B: Processor Processor Cache Cache Read-Exclusive Invalidation read-exclusive invalidates all other copies
Data Data State State Tag Tag Dirty Invalid - X - 3 Shared Memory (X ) MESI Coherence Example Load X Store X=3 Thread A: Thread B: Processor Processor Cache Cache Fill Read-Exclusive Invalidation the state ‘dirty’ implies exclusiveness
Speculative Coherence Example Highlights of our scheme: • detection of a data dependence violation • speculatively modifiedandshared cache lines Epoch6: Epoch4: Epoch5: Load X Store X=3 Load X
Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) Speculative Coherence Example Load X Epoch5: Epoch6: Processor Processor Cache Cache Read
Data Data State State Tag Tag Excl. Invalid X - - 2 Shared Memory (X=2) Speculative Coherence Example Load X Epoch5: Epoch6: Processor Processor Cache Cache Spec. Loaded Read Fill track which lines are speculatively loaded
Data Data State State Tag Tag Invalid Excl. - X - 2 Shared Memory (X=2) Speculative Coherence Example Load X Epoch5: Epoch6: Store X=3 Processor Processor Cache Cache Spec. Loaded Sp Read-Ex (epoch5) speculative msgs piggyback epoch number
Data Data State State Tag Tag Invalid Excl. - X - 2 Shared Memory (X=2) Speculative Coherence Example Load X Epoch5: Epoch6: Store X=3 Processor Processor Cache Cache Spec. Loaded Sp Read-Ex (epoch5) Sp Inv (epoch5) epoch5 < epoch6, and speculatively loaded
Data Data State State Tag Tag Invalid Invalid - - - - Shared Memory (X=2) Speculative Coherence Example Load X speculation failed! Epoch5: Epoch6: Store X=3 Processor Processor Cache Cache Sp Read-Ex (epoch5) Sp Inv (epoch5) speculation fails for epoch 6
Data Data State State Tag Tag Excl. Invalid X - - 3 Shared Memory (X=2) Speculative Coherence Example Load X speculation failed! Epoch5: Epoch6: Store X=3 Processor Processor Cache Cache Spec. Modified Fill Sp Read-Ex (epoch5) Sp Inv (epoch5) track which lines are speculatively modified
Speculative Coherence Example Highlights of our scheme: • detection of a data dependence violation • speculatively modifiedandshared cache lines Epoch6: Epoch4: Epoch5: Load X Store X=3 Load X
Epoch4: Processor Cache Data Data State State Tag Tag Excl. Invalid - X - 3 Shared Memory (X=2) Speculative Coherence Example Epoch5: Store X=3 Processor Cache Spec. Modified
Data Data State State Tag Tag Invalid Excl. - X - 3 Shared Memory (X=2) Speculative Coherence Example Epoch4: Epoch5: Store X=3 Load X Processor Processor Cache Cache Spec. Modified Read
Data Data State State Tag Tag Invalid X - - 3 Shared Memory (X=2) Speculative Coherence Example Epoch4: Epoch5: Store X=3 Load X Processor Processor Cache Cache Spec. Modified Shared Read notify shared both speculatively modified and shared!
Data Data State State Tag Tag X X 3 2 Shared Memory (X=2) Speculative Coherence Example Epoch4: Epoch5: Store X=3 Load X Processor Processor Cache Cache Spec. Loaded Spec. Modified Shared Shared Fill Read notify shared multiple versions of the same cache line
Summary of New Speculative Line State New cache line state: • has it been speculatively loaded? • detect dependence violations • has it been speculatively modified? • buffer speculative modifications • is it in a speculative shared or exclusive state? • important performance optimizations What if a speculative cache line is replaced? • speculation fails for that epoch
- - - - - - - - - - - - Implementation of Speculative State Processor Cache Data State Tag
Tag SL SM - - - - - - - - - - - - - - - - - - - - Implementation of Speculative State Processor Cache Speculatively Loaded Data State Speculatively Modified modest amount of extra space
Time Becomes Complete, Speculative Pass Homefree Squash Life Cycle of an Epoch Spawned Speculative Coherence Commit? Mechanisms to Squash or Commit
Flash Reset When Speculation Fails Processor Cache Data State Tag SM SL Sp Ex * * 0 1 Sp Sh * * 0 1 Sp Ex * * 1 0 Sp Sh * * 1 1
If Set then Invalidate; Flash Reset When Speculation Fails Processor Cache Data State Tag SM SL Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 * * 1 0 Sp Sh
When Speculation Fails Processor Cache Data State Tag SM SL Excl * * 0 0 * * 0 0 Shared Invalid * * 0 0 Invalid * * 0 0 quick bit operation
Time Becomes Complete, Speculative Pass Homefree Commit Life Cycle of an Epoch Spawned Speculative Coherence Commit? Mechanisms to Squash or Commit
Flash Reset When Speculation Succeeds Processor Cache Data State Tag SM SL Sp Ex * * 0 1 Sp Sh * * 0 1 Sp Ex * * 1 0 Sp Sh * * 1 1
SM & Exclusive: Become Dirty When Speculation Succeeds Processor Cache Data State Tag SM SL Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 Sp Sh * * 1 0
SM & Shared: Need Exclusive Access When Speculation Succeeds Processor Cache Data State Tag SM SL Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 Sp Sh * * 1 0 want to avoid searching entire cache
When Speculation Succeeds Processor Cache Data State Tag SM SL ORB Excl * * 0 0 - * * 0 0 Shared - X Sp Ex * * 1 0 Sp Sh X * 1 0 ownership required buffer (ORB)
- - X When Speculation Succeeds Processor Cache Data State Tag SM SL ORB Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 Sp Sh X * 1 0 Upgrade-Request (X)
- If SM, Become Dirty; Flash Reset - - When Speculation Succeeds Processor Cache Data State Tag SM SL ORB Excl * * 0 0 * * 0 0 Shared Sp Ex * * 1 0 Sp Sh X * 1 0 Ack (X) Upgrade-Request (X)
- - - When Speculation Succeeds Processor Cache Data State Tag SM SL ORB Excl * * 0 0 * * 0 0 Shared Dirty * * 0 0 Dirty X * 0 0 flush the ORB, then quick bit operations
Forwarding Data Between Epochs • predictable dependences cause frequent violations • compiler inserts wait-signal synchronization Load X Store X Wait Store X Signal Load X With Forwarding synchronize to avoid violations
Outline • Details of our Approach Performance • simulation infrastructure • single-chip multiprocessor performance • scaling beyond chip boundaries • Conclusions
C C C P P Crossbar Simulation Infrastructure Compiler system and tools based on SUIF • help analyze dependences, insert synchronization • produce MIPS binaries containing TLS primitives Benchmarks (all run to completion) • buk, compress95, ijpeg, equake Simulator • superscalar, similar to MIPS R10K • models all bandwidth and contention detailed simulation!