CSE 520 Advanced Computer Architecture Lec 14 – Ch4- MultiProcessor and Memory Coherence

CSE 520 Advanced Computer Architecture Lec 14 – Ch4- MultiProcessor and Memory Coherence Sandeep K. S. Gupta Based on Slides by H.-H. S. Lee

Outline • Ch4 CSE 520 Fall 2007

Bus-based shared memory P P P $ $ $ Memory Fully-connected shared memory (Dancehall) Distributed shared memory P P P P P $ $ $ $ $ Memory Memory Interconnection Network Interconnection Network Memory Memory Memory Hierarchy in a Multiprocessor Shared cache P P P Cache Memory CSE 520 Fall 2007

Cache Coherency • Closest cache level is private • Multiple copies of cache line can be present across different processor nodes • Local updates • Lead to incoherent state • Problem exhibits in both write-through and writeback caches • Bus-based  globally visible • Point-to-point interconnect  visible only to communicated processor nodes CSE 520 Fall 2007

Rd? Rd? X= -100 X= -100 X= -100 Example (Writeback Cache) P P P Cache Cache Cache X= 505 Memory X= -100 CSE 520 Fall 2007

Rd? X= 505 X= -100 X= 505 Example (Write-through Cache) P P P Cache Cache Cache X= -100 X= 505 Memory X= -100 CSE 520 Fall 2007

Defining Coherence • An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order • Write propagation • Writes are visible to other processes • Write serialization • All writes to a location are seen in the same order by all processes • E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi CSE 520 Fall 2007

A=1 B=2 T1 A=1 A=1 B=2 B=2 T2 A=1 A=1 B=2 B=2 T3 B=2 A=1 A=1 A=1 B=2 B=2 T3 B=2 B=2 A=1 A=1 See A’s update before B’s See B’s update before A’s Sounds Easy? A=0 B=0 P0 P1 P2 P3 CSE 520 Fall 2007

Bus Snooping based on Write-Through Cache • All the writes will be shown as a transaction on the shared bus to memory • Two protocols • Update-based Protocol • Invalidation-based Protocol CSE 520 Fall 2007

Bus Snooping (Update-based Protocol on Write-Through cache) P P P Cache Cache Cache X= -100 X= 505 X= 505 • Each processor’s cache controller constantly snoops on the bus • Update local copies upon snoop hit Memory Bus transaction X= -100 X= 505 Bus snoop CSE 520 Fall 2007

X= 505 Bus Snooping (Invalidation-based Protocol on Write-Through cache) P P P Load X Cache Cache Cache X= -100 X= 505 • Each processor’s cache controller constantly snoops on the bus • Invalidate local copies upon snoop hit Memory Bus transaction X= -100 X= 505 Bus snoop CSE 520 Fall 2007

BusWr / --- PrRd / BusRd Processor-initiated Transaction Bus-snooper-initiated Transaction PrWr / BusWr A Simple Snoopy Coherence Protocol for a WT, No Write-allocate Cache PrWr / BusWr PrRd / --- Valid Invalid Observed / Transaction CSE 520 Fall 2007

How about Writeback Cache? • WB cache to reduce bandwidth requirement • The majority of local writes are hidden behind the processor nodes • How to snoop? • Write Ordering CSE 520 Fall 2007

Cache Coherence Protocols for WB caches • A cache has an exclusive copy of a line if • It is the only cache having a valid copy • Memory may or may not have it • Modified (dirty) cache line • The cache having the line is the owner of the line, because it must supply the block CSE 520 Fall 2007

update update Cache Coherence Protocol(Update-based Protocol on Writeback cache) P P P Store X Cache Cache Cache X= 505 X= 505 X= -100 X= -100 X= -100 X= 505 Memory Bus transaction • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred CSE 520 Fall 2007

update update Cache Coherence Protocol(Update-based Protocol on Writeback cache) P P P Store X Load X Cache Cache Cache X= 505 X= 333 X= 505 X= 333 X= 505 X= 333 Hit ! Memory Bus transaction • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred CSE 520 Fall 2007

invalidate invalidate Cache Coherence Protocol(Invalidation-based Protocol on Writeback cache) P P P Store X Cache Cache Cache X= -100 X= -100 X= -100 X= 505 Memory Bus transaction • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location CSE 520 Fall 2007

Cache Coherence Protocol(Invalidation-based Protocol on Writeback cache) P P P Load X Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location CSE 520 Fall 2007

Cache Coherence Protocol(Invalidation-based Protocol on Writeback cache) Store X P P P Store X Store X Cache Cache Cache X= 444 X= 505 X= 333 X= 987 X= 505 Memory Bus transaction Bus snoop • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location CSE 520 Fall 2007

MSI Writeback Invalidation Protocol • Modified • Dirty • Only this cache has a valid copy • Shared • Memory is consistent • One or more caches have a valid copy • Invalid • Writeback protocol: A cache line can be written multiple times before the memory is updated. CSE 520 Fall 2007

MSI Writeback Invalidation Protocol • Two types of request from the processor • PrRd • PrWr • Three types of bustransactions post by cache controller • BusRd • PrRd misses the cache • Memory or another cache supply the line • BusRd eXclusive (Read-to-own) • PrWr is issued to a line which is not in the Modified state • BusWB • Writeback due to replacement • Processor does not directly involve in initiating this operation CSE 520 Fall 2007

PrRd / --- PrRd / --- PrWr / BusRdX PrRd / BusRd MSI Writeback Invalidation Protocol(Processor Request) PrWr / BusRdX PrWr / --- Modified Shared Invalid Processor-initiated CSE 520 Fall 2007

BusRd / Flush BusRd / --- BusRdX / Flush BusRdX / --- MSI Writeback Invalidation Protocol(Bus Transaction) • Flush data on the bus • Both memory and requestor will grab the copy • The requestor get data by • Cache-to-cache transfer; or • Memory Modified Shared Invalid Bus-snooper-initiated CSE 520 Fall 2007

BusRd / Flush BusRd / --- BusRdX / Flush BusRdX / --- BusRd / Flush MSI Writeback Invalidation Protocol(Bus transaction) Another possible implementation • Another possible, valid implementation • Anticipate no more reads from this processor • A performance concern • Save “invalidation” trip if the requesting cache writes the shared line Modified Shared Invalid Bus-snooper-initiated CSE 520 Fall 2007

MSI Writeback Invalidation Protocol PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated CSE 520 Fall 2007

X=10 S --- --- BusRd Memory S MSI Example P1 P2 P3 Cache Cache Cache Bus BusRd MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X CSE 520 Fall 2007

X=10 X=10 S S BusRd --- --- --- S BusRd BusRd Memory Memory S S MSI Example P1 P2 P3 Cache Cache Cache Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X CSE 520 Fall 2007

--- X=10 S I BusRdX --- --- S --- BusRd BusRd Memory Memory S S --- M BusRdX I MSI Example P1 P2 P3 Cache Cache Cache X=-25 M S X=10 Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X CSE 520 Fall 2007

BusRd --- --- S --- BusRd BusRd Memory Memory S S --- --- M S BusRd BusRdX P3 Cache S I MSI Example P1 P2 P3 Cache Cache Cache S X=-25 --- I X=-25 M S Bus MEMORY X=10 X=-25 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X P1 reads X CSE 520 Fall 2007

X=-25 S BusRd --- --- S --- BusRd BusRd Memory Memory S S --- --- S S S M BusRdX BusRd BusRd P3 Cache Memory S S I MSI Example P1 P2 P3 Cache Cache Cache X=-25 S X=-25 S M Bus MEMORY X=-25 X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X P1 reads X P2 reads X CSE 520 Fall 2007

MESI Writeback Invalidation Protocol • To reduce two types of unnecessary bus transactions • BusRdX that converts the block from S to M • BusRd that gets the line in S state when there is no sharers • Introduce the Exclusive state • One can write to the copy without generating BusRdX • Illinois Protocol: Proposed by Pamarcos and Patel in 1984 • Employed in Intel, PowerPC, MIPS CSE 520 Fall 2007

PrWr / --- PrRd, PrWr / --- PrRd / --- PrWr / BusRdX PrWr / BusRdX PrRd / BusRd (not-S) PrRd / --- PrRd / BusRd (S) MESI Writeback Invalidation ProtocolProcessor Request (Illinois Protocol) Exclusive Modified Invalid Shared S: Shared Signal Processor-initiated CSE 520 Fall 2007

BusRd / Flush BusRdX / Flush BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* MESI Writeback Invalidation ProtocolBus Transactions (Illinois Protocol) • Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data • Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) • Most of the MESI implementations simply write to memory Exclusive Modified Invalid Shared Bus-snooper-initiated CSE 520 Fall 2007 Flush*: Flush for data supplier; no action for other sharers

BusRdX / Flush BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* MESI Writeback Invalidation Protocol(Illinois Protocol) PrWr / --- PrRd, PrWr / --- PrRd / --- Exclusive Modified BusRd / Flush PrWr / BusRdX PrWr / BusRdX PrRd / BusRd (not-S) Invalid Shared PrRd / --- S: Shared Signal Processor-initiated PrRd / BusRd (S) Bus-snooper-initiated CSE 520 Fall 2007 Flush*: Flush for data supplier; no action for other sharers

Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by Sun UltraSparc AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 CPU1 L2 L2 System Request Interface Crossbar Mem Controller Hyper- Transport MOESI Protocol CSE 520 Fall 2007

Implication on Multi-Level Caches • How to guarantee coherence in a multi-level cache hierarchy • Snoop all cache levels? • Maintaining inclusion property • Ensure data in the outer level must be present in the inner level • Only snoop the outermost level (e.g. L2) • L2 needs to know L1 has write hits • Use Write-Through cache • Use Write-back but maintain another “modified-but-stale” bit in L2 CSE 520 Fall 2007

Inclusion Property • Not so easy … • Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1 • Split L1 caches: Imagine all caches are direct-mapped. • Different cache line sizes CSE 520 Fall 2007

Inclusion Property • Use specific cache configurations • E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size • Explicitly propagate L2 action to L1 • L2 replacement will flush the corresponding L1 line • Observed BusRdX bus transaction will invalidate the corresponding L1 line • To avoid excess traffic, L2 maintains an Inclusion bit for filtering CSE 520 Fall 2007

Presence bits, one for each node Modified bit Directory Directory-based Coherence Protocol • Snooping-based protocol • N transactions for an N-node MP • All caches need to watch every memory request from each processor • Not a scalable solution for maintaining coherence in large shared memory systems • Directory protocol • Directory-based control of who has what; • HW overheads to keep the directory (~ # lines * # processors) P P P P $ $ $ $ Interconnection Network Memory CSE 520 Fall 2007

1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 modified bit for each cache block in memory Directory-based Coherence Protocol P P P P P $ $ $ $ $ Interconnection Network Memory C(k) C(k+1) C(k+j) 1 presence bit for each processor, each cache block in memory CSE 520 Fall 2007

0 0 0 1 1 0 0 0 0 0 - - - - 1 1 1 0 - - - - 1 modified bit for each cache block in memory Directory-based Coherence Protocol (Limited Dir) P0 P13 P1 P14 P15 $ $ $ $ $ Interconnection Network Memory 1 1 0 0 0 0 1 - - - - Presence encoding is NULL or not Encoded Present bits (lg2N), each cache line can reside in 2 processors in this example CSE 520 Fall 2007

P P P P P P $ $ $ $ $ $ Memory Memory Memory Memory Memory Memory Directory Directory Directory Directory Directory Directory Interconnection Network Distributed Directory Coherence Protocol • Centralized directory is less scalable (contention) • Distributed shared memory (DSM) for a large MP system • Interconnection network is no longer a shared bus • Maintain cache coherence (CC-NUMA) • Each address has a “home” CSE 520 Fall 2007

Directory Directory Distributed Directory Coherence Protocol P P P P • Stanford DASH (4 CPUs in each cluster, total 16 clusters) • Invalidation-based cache coherence • Directory keeps one of the 3 status of a cache block at its home node • Uncached • Shared (unmodified state) • Dirty $ $ $ $ Memory Memory Memory Memory Snoop bus Snoop bus Interconnection Network CSE 520 Fall 2007

Directory Directory DASH Memory Hierarchy P P P P • Processor Level • Local Cluster Level • Home Cluster Level (address is at home) If dirty, needs to get it from remote node which owns it • Remote Cluster Level $ $ $ $ Memory Memory Memory Memory Snoop bus Snoop bus Interconnection Network CSE 520 Fall 2007

Go to Home Node Directory Coherence Protocol: Read Miss P Miss Z (read) P P $ $ $ Home of Z Memory Memory Memory Z Z Z 0 0 1 1 1 Interconnection Network Data Z is shared (clean) CSE 520 Fall 2007

Data Request Go to Home Node Respond with Owner Info Directory Coherence Protocol: Read Miss P Miss Z (read) P P $ $ $ Memory Memory Memory Z Z Z 1 0 0 1 1 0 1 Interconnection Network Data Z is Clean, Shared by 3 nodes Data Z is Dirty CSE 520 Fall 2007

Invalidate ACK Go to Home Node Respond w/ sharers ACK Invalidate Directory Coherence Protocol: Write Miss P Miss Z (write) P P $ $ $ Memory Memory Memory Z Z Z 1 0 0 1 1 0 1 0 Interconnection Network Write Z can proceed in P0 CSE 520 Fall 2007

P1 P1 P2 P2 A=1; Flag = 1; A=1; B=1; while (Flag==0) {}; print A; print B; print A; Memory Consistency Issue • What do you expect for the following codes? Initial values A=0 B=0 Is it possible P2 prints A=0? Is it possible P2 prints A=0, B=1? CSE 520 Fall 2007

Memory Consistency Model • Programmers anticipate certain memory ordering and program behavior • Become very complex When • Running shared-memory programs • A processor supports out-of-order execution • A memory consistency model specifies the legal ordering of memory events when several processors access the shared memory locations CSE 520 Fall 2007

Memory Sequential Consistency (SC) [Leslie Lamport] • An MP is Sequentially Consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. • Two properties • Program ordering • Write atomicity • Intuitive to programmers P P P CSE 520 Fall 2007

CSE 520 Advanced Computer Architecture Lec 14 – Ch4- MultiProcessor and Memory Coherence