Computer Architecture Memory Coherency & Consistency

Computer ArchitectureMemory Coherency & Consistency By Dan Tsafrir, 11/4/2011Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz

Processor 1 Processor 2 L1 cache L1 cache L2 cache (shared) Memory Coherency - intro • When there’s only one core • Caching doesn’t affect correctness • But what happens when ≥ 2 cores work simultaneously on same memory location? • If both are reading, not a problem • Otherwise, one might use a stale, out-of-date copy of the data • The inconsistencies might lead to incorrect execution • Terminology • Memory coherency <=> Cache coherency

The cache coherency problem for a single memory location Stale value, different than correspondingmemory location and CPU-1 cache.(The next read by CPU-2 will yield “1”.)

A memory system is coherent if… • Informally, we could say (or we would like to say) that... • A memory system is coherent if… • Any read of a data item returns the most recently written value of that data item • (This definition is intuitive, but overly simplistic) • More formally…

A memory system is coherent if… • - Processor P writes to location X, and later- P reads from X, and- No other processor writes to X between above write & read=> Read must return value previously written by P • - P1 writes to X- Some time – T – elapses- P2 reads from X=> For big enough T, P2 will read the value written by P1 • Two writes to same location by any two processors are serialized=> Are seen in the same order by all processors (if “1” and then “2” are written, no processor would read “2” & “1”)

A memory system is coherent if… • - Processor P writes to location X, and later- P reads from X, and- No other processor writes to X between above write & read=> Read must return value previously written by P • - P1 writes to X- Some time – T – elapses- P2 reads from X=> For big enough T, P2 will read the value written by P1 • Two writes to same location X by any two processors are serialized=> Are seen in the same order by all processors (if “1” and then “2” are written, no processor would read “2” & “1”) Simply preserves program order(needed even on uniprocessor). Defines notation of what it means to have acoherent view of memory; if X is never updated regardless of the duration of T, than the memory is not coherent. If P1 writes to X and then P2 writes to X, serialization of writes ensures that everyprocessor will see P2’s write eventually; otherwise P1’s value might be maintainedindefinitely.

Memory Consistency • The coherency definition is not enough • So as to be able to write correct programs • It must be supplemented by a consistency model • Critical for program correctness • Coherency & consistency are 2 different, complementary aspects of memory systems • Coherency • What values can be returned by a read • Relates to behavior of reads & writes to the same memory location • Consistency • When will a written value be returned by a subsequent read • Relates to behavior of reads & writes to different memory locations

Memory Consistency (cont.) • “How consistent is the memory system?” • A nontrivial question • Assume: locations A & B areoriginally cached by P1 & P2 • With initial value = 0 • If writes are immediately seenby other processors • Impossible for both “if” conditions to be true • Reaching “if” means either A or B must hold 1 • But suppose: • (1) “Write invalidate” can be delayed, and • (2) Processor allowed to compute during this delay • => It’s possible P1 & P2 haven’t seen the invalidations of B & A until after the reads, thus, both “if” conditions are true • Should this be allowed? • Determined by consistency model

Consistency models • From most strict to most relaxed • Strict consistency • Sequential consistency • Weak consistency • Release consistency • […many…] • Stricter models are • Easier to understand • Harder to implement • Slower • Involve more communication • Waste more energy

Strict consistency (“linearizability”) • All memory operations are ordered in time • Any read to location X returns the most recent write op to X • This is the intuitive notion of memory consistency • But too restrictive and thus unused

Sequential consistency • Relaxation of strict (defined by Lamport) • Requires the result of any execution be the same as if memory accesses were executed in some arbitrary order • Can be a different order upon each run • Left is sequentially consistent (can be ordered as in the right) • Q. What if we flip the order of P2’s reads (on left)? time

Weak consistency • Access to “synchronization variables” are sequentially consistent • No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere • No data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed • In other words, the processor doesn’t need to broadcast values at all, until a synchronization access happens • But then it broadcasts all values to all cores

Release consistency • Before accessing shared variable • Acquire op must be completed • Before a release allowed • All accesses must be completed • Acquire/release calls are sequentially consistent • Serves as “lock”

MESI Protocol • Each cache line can be on one of 4 states • Invalid – Line data is not valid (as in simple cache) • Shared – Line is valid & not dirty, copies may exist in other caches • Exclusive – Line is valid & not dirty, other processors do not have the line in their local caches • Modified – Line is valid & dirty, other processors do not have the line in their local caches • (MESI = Modified, Exclusive, Shared, Invalid) • Achieves sequential consistency

Two classes of protocols to track sharing • Directory based • Status of each memory block kept in just 1 location (=directory) • Directory-based coherence has bigger overhead • But can scale to bigger core counts • Snooping • Every cache holding a copy of the data has a copy of the state • No centralized state • All caches are accessible via broadcast (bus or switch) • All cache controllers monitor (or “snoop”) the broadcasts • To determine if they have a copy of what’s requsted

Processor 1 Processor 2 L1 cache L1 cache L2 cache (shared) Memory Multi-processor System: Example • P1 reads 1000 • P1 writes 1000 [1000] [1000]: 6 miss E M 00 [1000]: 5 10 miss [1000]: 5 [1000]: 5

Processor 1 Processor 2 L1 cache L1 cache L2 cache (shared) Memory Multi-processor System: Example • P1 reads 1000 • P1 writes 1000 • P2 reads 1000 • L2 snoops 1000 • P1 writes back 1000 • P2 gets 1000 [1000] [1000]: 6 [1000]: 6 miss S M S [1000]: 6 [1000]: 5 10 11 [1000]: 5

Processor 1 Processor 2 L1 cache L1 cache L2 cache (shared) Memory Multi-processor System: Example • P1 reads 1000 • P1 writes 1000 • P2 reads 1000 • L2 snoops 1000 • P1 writes back 1000 • P2 gets 1000 [1000]: 6 [1000]: 6 [1000] [1000]: 6 I M S E S 10 01 11 [1000] [1000]: 6 [1000]: 5 • P2 requests for ownership with write intent

The alternative: incoherent memory • As core counts grow, many argue that maintaining coherence • Will slow down the machines • Will waste a lot of energy • Will not scale • Intel SCC • Single chip cloud computer – for research purposes • 48 cores • Shared, incoherent memory • Software is responsible for correctness • The Barrelfish operating system • By Microsoft & ETH (Zurich) • Assumes no coherency as the base line

Intel SCC Shared (incoherent)memory

Computer Architecture Memory Coherency & Consistency