Shared Memory Consistency Models: A Tutorial

By SaritaAdve & KouroshGharachorloo Slides by Jim Larson Shared Memory Consistency Models:A Tutorial

Outline • Concurrent programming on a uniprocessor • The effect of optimizations on a uniprocessor • The effect of the same optimizations on a multiprocessor • Methods for restoring sequential consistency • Conclusion

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag2 = 0

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 1 Flag2 = 1 Critical Section is Protected Works the same if Process 2 runs first! Process 2 enters its Critical Section

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 1 Flag2 = 0 Arbitrary interleaving of Processes

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 1 Flag2 = 1 Arbitrary interleaving of Processes

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 1 Flag2 = 1 Both processes can block but the critical section remains protected. Deadlock can be fixed by extending the algorithm with turn-taking

Outline • Concurrent Programming on a Uniprocessor • The effect of optimizations on a Uniprocessor • The effect of the same optimizations on a Multiprocessor without Sequential Consistency • Methods for restoring Sequential Consistency • Conclusion

SpeedUp: Write takes 100 cycles, buffering takes 1 cycle. So Buffer and keep going. Problem: Read from a Location with a buffered Write pending?? (Single Processor Case) Optimization: Write Buffer with Bypass

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag1 = 1 Flag2 = 0 Write Buffering

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag2 = 1 Flag1 = 1 Flag2 = 0 Write Buffering

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag2 = 1 Flag1 = 1 Flag2 = 0 Write Buffering Uh-Oh!

SpeedUp: Write takes 100 cycles, buffering takes 1 cycle. Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Optimization: Write Buffer with Bypass

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section STALL! Flag1 = 0 Flag2 = 1 Flag1 = 1 Flag2 = 0 Write Buffering Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 1 Flag2 = 1 Flag2 = 0 Write Buffering Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag2 = 0 Does this work for Multiprocessors??

Outline • Concurrent programming on a uniprocessor • The effect of optimizations on a uniprocessor • The effect of the same optimizations on a multiprocessor • Methods for restoring sequential consistency • Conclusion

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag2 = 0 Does this work for Multiprocessors?

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag2 = 0 Multiprocessor Case Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag1 = 1 Flag2 = 0 Multiprocessor Case Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag1 = 1 Flag2 = 1 Flag2 = 0 Multiprocessor Case Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag1 = 1 Flag2 = 1 What Now?? Flag2 = 0 Multiprocessor Case Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag1 = 1 Flag2 = 1 Flag2 = 0 Multiprocessor Case Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag1 = 1 Flag2 = 1 How Did That Happen?? Flag2 = 0 Multiprocessor Case Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

What happens on a Processor stays on that Processor

Dekker's Algorithm: Global Flags Init to 0 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag1 = 0 Flag1 = 1 Flag2 = 1 Flag2 = 0 Processor 2 knows nothing about the write to Flag1, so has no reason to stall! Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

A more general way to look at the Problem: Reordering of Reads and Writes (Loads and Stores).

Consider the Instructions in these processes. Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Simplify as: WX WY RX RY

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 RY RY WY RX WY RX WX WX WX WX WX WX WY RX RY RY RX WY WY RX RY RY RX WY RX WY RX WY RY RY RX WY RX WY RY RY RX WY RX WY RY RY WX WX WX WX WX WX WX WX WX WX WX WX RY RY WY RX WY RX RY RY WY RX WY RX RY RY WY RX WY RX WY RX RY RY RX WY WY RX RY RY RX WY WX WX WX WX WX WX RX WY RX WY RY RY There are 4! or 24 possible orderings. If either WX<RX or WY<RY Then the Critical Section is protected (Correct Behavior).

WY RX RY RY RX WY WY RX RY RY RX WY WX WX WX WX WX WX RX WY RX WY RY RY 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 WX WX WX WX WX WX RY RY WY RX WY RX RY RY WY RX WY RX RY RY WY RX WY RX RY RY WY RX WY RX WX WX WX WX WX WX WY RX RY RY RX WY WY RX RY RY RX WY RX WY RX WY RY RY RX WY RX WY RY RY RX WY RX WY RY RY WX WX WX WX WX WX There are 4! or 24 possible orderings. If either WX<RX or WY<RY Then the Critical Section is protected (Correct Behavior) 18 of the 24 orderings are OK. But the other 6 are trouble!

Consider another example...

Global Data Initialized to 0 Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 0 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate.

Global Data Initialized to 0 Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Data = 2000 Head = 0 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate.

Global Data Initialized to 0 Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 1 Data = 2000 Head = 0 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate.

Global Data Initialized to 0 Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Data = 2000 Head = 1 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate.

Global Data Initialized to 0 Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Data = 2000 Wrong Data! Head = 1 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate.

Global Data Initialized to 0 Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 1 Data = 2000 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate. Fix: Write must be acknowledged before another write (or read) from the same processor.

Global Data Initialized to 0 Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 0 Data = 0 Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Global Data Initialized to 0 Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Memory Interconnect Head = 0 Data = 0 Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Global Data Initialized to 0 Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Memory Interconnect Head = 0 Data = 2000 Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Shared Memory Consistency Models: A Tutorial