220 likes | 236 Views
“Shared Memory Consistency Models: A Tutorial” – Adve & Gharachorloo. Robert T. Bauer. Shared Memory. Shared memory – single address space abstraction in a multiprocessor environment. Memory Model. Specifics how reads and writes appear to executed May (usually) varies by level
E N D
“Shared Memory Consistency Models: A Tutorial” – Adve & Gharachorloo Robert T. Bauer
Shared Memory • Shared memory – single address space abstraction in a multiprocessor environment.
Memory Model • Specifics how reads and writes appear to executed • May (usually) varies by level • Programming language can provide a memory model, for example Java has its own (JMM, JSR 133) • Processor • Memory subsystem
Definitions • Sequential (Processor) • Result of an execution is the same as if the operations had been executed in the order specified by the program. • Sequentially Consistent (Multiprocessor) • Result of any execution is the same as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appear in the sequence in the order specified by the program.
Uniprocessor Processor Memory operations in program order — sequential memory
Multiprocessor Processor Processor Sequential Consistency memory
Relaxing Sequential Consistency • Program Order • Write followed by a read to a different location can be reordered • Write followed by a write to a different location can be reordered • Read followed by a write to (or read from) a different location can be reordered • Write Atomicity • Another processor’s writes can be read even though the write is not visible to the writing processor • A processor’s own writes can be read even though the writes are not visible to other processors
Uniprocessor with Write Buffer Processor P1: flag1 = 1 if(flag2 == 0){ critical section } P2: flag2 = 1 if(flag1 == 0){ critical section } Write Buffer memory
Multiprocessor with Write Buffer Processor Processor P2: flag2 = 1 if(flag1 == 0){ critical section } P1: flag1 = 1 if(flag2 == 0){ critical section } Write Buffer Write Buffer memory
P1: flag1 = 1 mb() if(flag2 == 0){ critical section } P2: flag2 = 1 mb() if(flag1 == 0){ critical section } Memory Barrier
Effect of Memory Barrier Processor Processor P1: flag1 = 1 mb() if(flag2 == 0){ critical section } P1: flag1 = 1 mb() if(flag2 == 0){ critical section } Write Buffer Write Buffer memory
Write Through & Memory Bus P1 P1 P1P2 data = 2000 while(head ==0) head = 1 ; … = data data Write Through Cache Write Through Cache head Interconnect P2 sees write to “head” before seeing write to data 2 1 Memory head data Program Order has been relaxed
P1’s writes arrive in-order to memory The read from data occurs before the cache-invalidate signal arrives at P2 P2 reads “new” value of head P2 reads “old” value of data from cache ISSUE Memory operations need to “complete.” Cache-invalidate signal needs to propagate Write Atomicity has been relaxed Late Cache Invalidate Signal P1 P1 invalidate data Write Through Cache Write Through Cache head data Interconnect 1 2 Memory head data 3
Relaxing Write to Read • Reorder read following previous writes • IBM prohibits read from returning the value of a write before the write is visible to all processors. • TSO can read own processors write • Cannot read another processor’s write early (must be visible to all processors). • Our buffer example is similar in effect • IBM has serialization instruction (so that the writes propagate and the reads won’t be reordered) • TSO – won’t be reordered if instruction is RMW – so you can “enforce” order using a read-modify-write instruction.
Relaxing Write to Read/Write • SPARC PSO • Writes to different locations can be pipelined or overlapped – reach memory or caches out-of-order • PSO identical to TSO, but allows a processor to read its own writes early • Processors cannot read other processor’s writes before they are globally visible • STBAR (store barrier) so writes can’t get reordered
Weak Ordering • Data operations (read/writes) • Synchronization operations (fences/barriers) • Model allows • Reordering of operations between synchronization operations • Each processor ensures that synchronization instructions are not issued until all previous operations (data and sync) are complete. • Ensures that writes always appear atomic, so no fence is required to ensure write atomicity
Release Consistency • Acquire: read memory operation that gains access to a set of shared locations • Release: a write operation that grants permission for accessing a set of shared locations • Two flavors • Maintain sequential consistency among “special” operations • Maintain processor consistency among “special” operations
Release Consistency • RC – SC • Acquire all, all release, special special • If acquire appears before any operation, program order is enforced so that “acquire” completes before the following operations. • RC – PC • Acquire all, all->release, special special, except for a special write followed by a special read
RC - PC • Program order for read following write requires using rmw operations, if write being ordered is “ordinary” then the write in the rmw needs to be a release
Just to make it more complicated • Alpha • mb: enforce program order between any statements • wmb: only enforce program order among write statements • RMO • (LD | ST) # (LD |ST) • LDST#LD means that load and store operations before the barrier must be completed before any load operation after the barrier. Store operations after the barrier may be reordered before the barrier. • Power • SYNC: like alpha’s mb, except that when placed between two reads to the same location, the second read may go first. • Power allows writes to be seen early • RMW sequences are used to make writes appear atomic
Discussion/Conclusion • System-centric: directly expose ordering and write atomicity relaxations. Complicated, difficult to port. • Programmer-centric: Programmer provides information to determine what optimizations can be performed (when reading/writing particular variables). Compiler complexity increased. Debugging more difficult • Relaxed memory models have proven to be effective in increasing performance; the cost of this higher performance is greater complexity.