Computer architecture II

Computer architecture II Computer Architecture II

Shared Memory Multiprocessors Computer Architecture II

Outline: NOW: • Bus based shared memory (chapters 5, 6 Culler) • Cache Coherence • Snooping Cache Coherence Protocols • Consistency • Synchronization LATER: • NUMA machines (chapters 7, 8) • Directory based coherency protocols Computer Architecture II

Shared Memory Multiprocessors • Symmetric Multiprocessors (SMPs) • Symmetric access to the whole main memory from any processor • Dominate the server market • Attractive as throughput servers and for parallel programs • Efficient resource sharing (processors, memory) • Uniform access via loads/stores • Automatic data movement and coherent replication in cache Computer Architecture II

Pr ogramming models Message passing Compilation Multipr ogramming Communication abstraction or library User/system boundary Shar ed addr ess space Operating systems support Har dwar e/softwar e boundary Communication har dwar e Physical communication medium Supporting Programming Models • Address translation and protection in hardware (hardware SAS) • Message passing using shared memory buffers • can be very high performance since no OS involvement necessary • We focus today on supporting coherent shared address space Computer Architecture II

Memory Systems in Multiprocessors Actual SMPs,UMA, 20-30 processors 80s, UMA, 2-8 processors • b, c, d : private caches • b: small scale configuration • d: scalable machines • a, c: rarely used Scalable, UMA, memory far Most scalable, NUMA Computer Architecture II

C P U T a p e C a c h e M a i n M e m o r y D i s k M e m o r y M e m o r y The Memory Hierarchy Compo- nent Access Random Random Random Direct Sequential Capa- 64-1024+ 8KB-8MB 64MB-2GB 8GB 1TB city, bytes Latency .4-10ns .4-20ns 10-50ns 10ms 10ms-10s Block 4 B 64 B 64 B 4KB 4KB size Band- System System 10-4000 50MB/s 1MB/s width clock Clock MB/s Rate rate-80MB/s Cost/MB High $10 $.25 $0.002 $0.01 Some Typical Values:† †As of 2003-4. They go out of date immediately. Computer Architecture II

Caches • Caches play a key role in all cases • Address temporal locality • Reduce average data access time to memory • Some data in cache, some only in memory • Reduce bandwidth demands placed on shared interconnect • Cache types based on write access • Write-through: the value is also written to memory • Write-back: the value is written to memory only at block replacement WRITE-THROUGH WRITE-BACK X=5 X=5 Cache X:0 Cache X:0 5 5 Memory X:0 5 X:0 Computer Architecture II

Write-through vs. write-back • Write-though • Every write from every processor goes to shared bus and memory • Easy protocol • Write-through unpopular for SMPs • Write-back • absorb most writes as cache hits • Write hits don’t go on bus • But now how do we ensure write propagation and serialization? • Need more sophisticated protocols WRITE-THROUGH WRITE-BACK X=5 X=5 Cache X:0 Cache X:0 5 5 Memory X:0 5 X:0 Computer Architecture II

Caches and Cache Coherence • Private processor caches create a problem • Copies of a variable can be present in multiple caches • A write by one processor may not become visible to others • Other processors may access stale value in their caches • Cache coherence problem Computer Architecture II

Example Cache Coherence Problem • How can this problem be solved? • P1 reads u • P3 reads u • P3 : u=7 • P1 reads u • P2 reads u What do P1 and P2 see in steps 4 and 5? Computer Architecture II

A Coherent Memory System: Intuition • Reading a location should return latest value written (by any process) • Easy in uniprocessors • Except for I/O: coherence between I/O devices and processors • Problem: DMA and the processor write to main memory, but the processor through the cache • Infrequent => software solutions work fine • uncacheable operations, uncached loads/store for I/O memory regions, flush pages, pass I/O data through caches • The coherence problem is more performance-critical in multiprocessors • Fast hardware solutions desired Computer Architecture II

Problems with the Intuition • Recall: • Value returned by read should be last value written • But “last” is not well-defined! • Even in sequential case: • “last” is defined in terms of program order, not time • Order of operations in the machine language presented to processor • “Subsequent” defined in analogous way, and well defined • In parallel case: • program order defined within a process, but need to make sense of orders across processes • Must define a meaningful semantics • the answer involves both “cache coherence” (TODAY) and an appropriate “memory consistency model” (NEXT TIME) Computer Architecture II

P P 1 2 (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; Program order • Sequential program order • P1: 1a->1b • P2: 2a->2b • Parallel program order: a arbitrary interleaving of sequential orders of P1 and P2 • 1a->1b->2a->2b • 1a->2a->1b->2b • 2a->1a->1b->2b • 2a->2b->1a->1b Computer Architecture II

Formal Definition of Coherence • Results of a program: values returned by its read operations • A memory system is coherent if the results of any execution of a program are such that, for each location, it is possible to construct a hypothetical serial order of all operations to the given location that is consistent with the results of the execution and in which: 1. operations issued by any particular process occur in the order issued by that process, and 2. the value returned by a read is the value written by the last write to that location in the serial order • Two necessary features: • Write propagation: value written must become visible to others • Write serialization: writes to location seen in same order by all • if I see w1 after w2, you should not see w2 before w1 • no need for read serialization since reads not visible to others Computer Architecture II

Cache Coherence Solutions • Software Based: • often used in clusters of workstations or PCs (e.g., “Treadmarks DSM”) • extend virtual memory system to perform more work on page faults • send messages to remote machines if necessary • Hardware Based: • two most common variations: • “snoopy” schemes • rely on broadcast to observe all coherence traffic • well suited for buses and small-scale systems • example: Almost all dual-processor SMPs • directory schemes • uses centralized information to avoid broadcast • scales well to larger numbers of processors • example: SGI Origin 2000 Today Later Computer Architecture II

invalidate Write-through 7 Ex: Cache Coherence Problem Solution INVALIDATE write-through: • before completing P3 :u=7 • send a signal on the bus to invalidate all the copies of the block cache where u is stored ( cache of P1) • Write through the new value to the memory • P1 and P2 will read 7 from main memory (the copy of P1 has been invalidated) • P1 reads u • P3 reads u • P3 : u=7 • P1 reads u • P2 reads u What do P1 and P2 see in steps 4 and 5? Computer Architecture II

update Write-through 7 Ex: Cache Coherence Problem Solution UPDATE write-through: • before completing P3 :u=7 • send a signal on the bus to update all the copies of the block cache where u is stored ( cache of P1) • Write through the new value to the memory • P1 reads 7 from its cache (the copy has been updated) and P2 will read 7 from main memory • P1 reads u • P3 reads u • P3 : u=7 • P1 reads u • P2 reads u What do P1 and P2 see in steps 4 and 5? 7 Computer Architecture II

Outline: NOW: • Bus based shared memory (chapters 5, 6 Culler) • Cache Coherence • Snooping Cache Coherence Protocols • Consistency • Synchronization LATER: • NUMA machines (chapters 7, 8) • Directory based coherency protocols Computer Architecture II

Memory Systems in Multiprocessors Actual SMPs,UMA, 20-30 processors 80s, UMA, 2-8 processors WE WILL DISCUSS NOW b ! Scalable, UMA, memory far away Most scalable, NUMA Computer Architecture II

Bus-based Cache Coherence • Built on top of two fundamentals of uni-processor systems • Bus transactions • State transition diagram in cache • Uni-processor bus transaction: • Three phases • Arbitration: decide who will get control of the bus • command/address • data transfer • All devices observe addresses, one is responsible for answering • Uni-processor cache states: • A finite-state machine for every block • Write-through: two states: valid, invalid • Write-back: one more state: modified (“dirty”) Computer Architecture II

Snooping-based Coherence • Basic Idea • Transactions on bus are visible to all processors • Processors or their representatives can snoop (monitor) bus and take action on relevant events (e.g. change state) • Implementing a Protocol • Cache controller receives inputs: • Requests from processor • Bus requests/responses from snooper • Cache controller actions on input • Updates state, responds with data, generates new bus transactions • Protocol is a distributed algorithm: cooperating state machines • Set of states, state transition diagram, actions • Granularity of coherence is typically cache block (e.g.: 32 bytes) • The same with granularity of allocation in cache and of transfer to/from cache Computer Architecture II

Coherence with Write-through Caches • Key extensions to uniprocessor: snooping, invalidating/updating caches • no new states or bus transactions in this case • invalidation- versus update-based protocols • Write propagation: even in invalidation case, later reads will see new value • invalidation causes miss on later access, and memory up-to-date via write-through Computer Architecture II

PrRd/- PrWr/BusWr V PrRd/BusRd BusWr/- I PrWr/BusW Processor initiated action Snooper initiated action Invalidation-based write-through State Transition Diagram • One transition diagram per cache block • Block states V:valid, I:invalid • V: block contains a correct copy of main memory • I: block is not in the cache • Notation a/b • a : event • b: action taken on event • Events • Processor can Rd/Wr from/to block • Bus can Rd/Wr from/to block Computer Architecture II

PrRd/- PrWr/BusWr V PrRd/BusRd BusWr/- I PrWr/BusWr Processor initiated action Snooper initiated action Invalidation-based write-through State Transition Diagram • Write will invalidate all other caches (no local change of state) • can have multiple simultaneous readers of block, but write invalidates them • Implementation • Hardware state bits associated only with blocks that are in the cache • other blocks can be seen as being in invalid (not-present) state in that cache Computer Architecture II

Problems with Write Through • High bandwidth requirements • Every write from every processor goes to shared bus and memory • Write-through unpopular for SMPs • Write-back caches absorb most writes as cache hits • Write hits don’t go on bus • But now how do we ensure write propagation and serialization? • Need more sophisticated protocols: large design space Computer Architecture II

Write-Back Snoopy Protocols • No need to change processor, main memory, cache … • Extend cache controller (not only V and I states) and exploit bus (provides serialization) • Dirty state now also indicates exclusive ownership • Exclusive: only cache with a valid copy (main memory may be too) • Owner: responsible for supplying block upon a request for it • 2 types of protocols • Invalidation based • Update based Computer Architecture II

Basic MSI Write-back Invalidation Protocol • States • Invalid (I) • Shared (S): one or more • Dirty or Modified (M): one only • Processor Events: • PrRd (read) • PrWr (write) • Bus Transactions • BusRd (read): asks for copy with no intent to modify • BusRdX (read exclusive): asks for copy with intent to modify • BusWB (write back): updates memory • Actions • Update state, perform bus transaction, flush value onto bus Computer Architecture II

State Transition Diagram PrRd/— PrWr/- • Replacement and write backs are not shown • Rd/Wr in M and Rd in S state do not cause bus transaction • But: Rd/Wr in I state cause 2 bus transactions • Wr in S state 2 bus transactions • And data sent at RdX • Can spare the data transfer, because already have latest data: can use upgrade (BusUpgr) instead of BusRdX M BusRd/Flush PrWr/ BusRdX BusRdX/Flush S BusRdX/— PrRd/BusRd PrRd/— BusRd/— PrW r/BusRdX I Computer Architecture II

P1 P2 P3 PrRd/— PrW r/— PrRd U M U U U U U S S S S M 7 5 5 7 7 BusRd/Flush PrW r/BusRdX BusRd U BusRdX/Flush S BusRd U I/O devices 7 BusRdX/— u :5 BusRdx U PrRd/BusRd Memory PrRd/— BusRd/— PrW r/BusRdX I Example: Write-Back Protocol PrRd U PrRd U PrWr U 7 BusRd Flush • P1 reads u • P3 reads u • P3 : u=7 • P1 reads u • P2 reads u Computer Architecture II

MESI (4-state) Invalidation Protocol • Problem with MSI protocol • Reading and modifying data is 2 bus transactions, even if no sharing! • even in a sequential program! • BusRd (I->S) followed by BusRdX or BusUpgr (S->M) • Add exclusive state: not modified block resides only in local cache => write locally without bus transaction • States • invalid • exclusive or exclusive-clean (only this cache has copy, but not modified) • shared (two or more caches may have copies) • modified (dirty) Computer Architecture II

MESI State Transition Diagram PrRd/- PrWr/- • Goal: no bus transaction on E • When does I go to E and when to S? • I -> E on PrRd if no other processor has a copy • I -> S otherwise • S additional signal on bus • BusRd(S): on BusRd signal, if one processor holds the block it asserts S (makes it 1) M BusRdX/Flush BusRd/Flush PrWr/- PrWr/ BusRdX E BusRd/ Flush BusRdX/Flush PrRd/— PrWr/ S BusRdX ¢ BusRdX/Flush PrRd/ BusRd(S) ) PrRd/— ¢ BusRd/Flush PrRd/ BusRd(S) I Computer Architecture II

Dragon Write-Back Update Protocol • 4 states • Exclusive-clean or exclusive (E): locally and memory have it • Shared clean (Sc): locally, others, and maybe memory, but I’m not owner • Shared modified (Sm): locally and others but not memory, and I’m the owner • Sm and Sc can coexist in different caches, with only one Sm • Modified or dirty (D): locally and nowhere else • No invalid state • If in cache, cannot be invalid • If not present in cache, can view as being in not-present or invalid state • New processor events:PrRdMiss, PrWrMiss • Introduced to specify actions when block not present in cache • New bus transaction: BusUpd • Broadcasts single word written on bus; updates caches that hold a copy Computer Architecture II

Dragon State Transition Diagram PrRd/— PrRd/— BusUpd/Update BusRd/— Sc E PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrW r/— PrW r/BusUpd(S) PrW r/BusUpd(S) BusUpd/Update BusRd/Flush PrW rMiss/BusRd(S) PrW rMiss/(BusRd(S); BusUpd) M Sm PrW r/BusUpd(S) PrRd/— PrRd/— PrW r/BusUpd(S) PrW r/— BusRd/Flush Computer Architecture II

Invalidate versus Update • Basic question of program behavior • Is a block written by one processor read by others before it is rewritten? • Invalidation: • Yes => readers will take a miss • No => multiple writes without additional traffic • Update: • Yes => readers will not miss if they had a copy previously • single bus transaction to update all copies • No => multiple useless updates, even to dead copies • Invalidate or update may be better depending on the application • Invalidation protocols much more popular • Some systems provide both, or even hybrid Computer Architecture II

Computer architecture II