1 / 35

Computer architecture II

Computer architecture II. Shared Memory Multiprocessors. Outline:. NOW: Bus based shared memory (chapters 5, 6 Culler) Cache Coherence Snooping Cache Coherence Protocols Consistency Synchronization LATER: NUMA machines (chapters 7, 8) Directory based coherency protocols.

agalia
Download Presentation

Computer architecture II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer architecture II Computer Architecture II

  2. Shared Memory Multiprocessors Computer Architecture II

  3. Outline: NOW: • Bus based shared memory (chapters 5, 6 Culler) • Cache Coherence • Snooping Cache Coherence Protocols • Consistency • Synchronization LATER: • NUMA machines (chapters 7, 8) • Directory based coherency protocols Computer Architecture II

  4. Shared Memory Multiprocessors • Symmetric Multiprocessors (SMPs) • Symmetric access to the whole main memory from any processor • Dominate the server market • Attractive as throughput servers and for parallel programs • Efficient resource sharing (processors, memory) • Uniform access via loads/stores • Automatic data movement and coherent replication in cache Computer Architecture II

  5. Pr ogramming models Message passing Compilation Multipr ogramming Communication abstraction or library User/system boundary Shar ed addr ess space Operating systems support Har dwar e/softwar e boundary Communication har dwar e Physical communication medium Supporting Programming Models • Address translation and protection in hardware (hardware SAS) • Message passing using shared memory buffers • can be very high performance since no OS involvement necessary • We focus today on supporting coherent shared address space Computer Architecture II

  6. Memory Systems in Multiprocessors Actual SMPs,UMA, 20-30 processors 80s, UMA, 2-8 processors • b, c, d : private caches • b: small scale configuration • d: scalable machines • a, c: rarely used Scalable, UMA, memory far Most scalable, NUMA Computer Architecture II

  7. C P U T a p e C a c h e M a i n M e m o r y D i s k M e m o r y M e m o r y The Memory Hierarchy Compo- nent Access Random Random Random Direct Sequential Capa- 64-1024+ 8KB-8MB 64MB-2GB 8GB 1TB city, bytes Latency .4-10ns .4-20ns 10-50ns 10ms 10ms-10s Block 4 B 64 B 64 B 4KB 4KB size Band- System System 10-4000 50MB/s 1MB/s width clock Clock MB/s Rate rate-80MB/s Cost/MB High $10 $.25 $0.002 $0.01 Some Typical Values:† †As of 2003-4. They go out of date immediately. Computer Architecture II

  8. Caches • Caches play a key role in all cases • Address temporal locality • Reduce average data access time to memory • Some data in cache, some only in memory • Reduce bandwidth demands placed on shared interconnect • Cache types based on write access • Write-through: the value is also written to memory • Write-back: the value is written to memory only at block replacement WRITE-THROUGH WRITE-BACK X=5 X=5 Cache X:0 Cache X:0 5 5 Memory X:0 5 X:0 Computer Architecture II

  9. Write-through vs. write-back • Write-though • Every write from every processor goes to shared bus and memory • Easy protocol • Write-through unpopular for SMPs • Write-back • absorb most writes as cache hits • Write hits don’t go on bus • But now how do we ensure write propagation and serialization? • Need more sophisticated protocols WRITE-THROUGH WRITE-BACK X=5 X=5 Cache X:0 Cache X:0 5 5 Memory X:0 5 X:0 Computer Architecture II

  10. Caches and Cache Coherence • Private processor caches create a problem • Copies of a variable can be present in multiple caches • A write by one processor may not become visible to others • Other processors may access stale value in their caches • Cache coherence problem Computer Architecture II

  11. Example Cache Coherence Problem • How can this problem be solved? • P1 reads u • P3 reads u • P3 : u=7 • P1 reads u • P2 reads u What do P1 and P2 see in steps 4 and 5? Computer Architecture II

  12. A Coherent Memory System: Intuition • Reading a location should return latest value written (by any process) • Easy in uniprocessors • Except for I/O: coherence between I/O devices and processors • Problem: DMA and the processor write to main memory, but the processor through the cache • Infrequent => software solutions work fine • uncacheable operations, uncached loads/store for I/O memory regions, flush pages, pass I/O data through caches • The coherence problem is more performance-critical in multiprocessors • Fast hardware solutions desired Computer Architecture II

  13. Problems with the Intuition • Recall: • Value returned by read should be last value written • But “last” is not well-defined! • Even in sequential case: • “last” is defined in terms of program order, not time • Order of operations in the machine language presented to processor • “Subsequent” defined in analogous way, and well defined • In parallel case: • program order defined within a process, but need to make sense of orders across processes • Must define a meaningful semantics • the answer involves both “cache coherence” (TODAY) and an appropriate “memory consistency model” (NEXT TIME) Computer Architecture II

  14. P P 1 2 (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; Program order • Sequential program order • P1: 1a->1b • P2: 2a->2b • Parallel program order: a arbitrary interleaving of sequential orders of P1 and P2 • 1a->1b->2a->2b • 1a->2a->1b->2b • 2a->1a->1b->2b • 2a->2b->1a->1b Computer Architecture II

  15. Formal Definition of Coherence • Results of a program: values returned by its read operations • A memory system is coherent if the results of any execution of a program are such that, for each location, it is possible to construct a hypothetical serial order of all operations to the given location that is consistent with the results of the execution and in which: 1. operations issued by any particular process occur in the order issued by that process, and 2. the value returned by a read is the value written by the last write to that location in the serial order • Two necessary features: • Write propagation: value written must become visible to others • Write serialization: writes to location seen in same order by all • if I see w1 after w2, you should not see w2 before w1 • no need for read serialization since reads not visible to others Computer Architecture II

  16. Cache Coherence Solutions • Software Based: • often used in clusters of workstations or PCs (e.g., “Treadmarks DSM”) • extend virtual memory system to perform more work on page faults • send messages to remote machines if necessary • Hardware Based: • two most common variations: • “snoopy” schemes • rely on broadcast to observe all coherence traffic • well suited for buses and small-scale systems • example: Almost all dual-processor SMPs • directory schemes • uses centralized information to avoid broadcast • scales well to larger numbers of processors • example: SGI Origin 2000 Today Later Computer Architecture II

  17. invalidate Write-through 7 Ex: Cache Coherence Problem Solution INVALIDATE write-through: • before completing P3 :u=7 • send a signal on the bus to invalidate all the copies of the block cache where u is stored ( cache of P1) • Write through the new value to the memory • P1 and P2 will read 7 from main memory (the copy of P1 has been invalidated) • P1 reads u • P3 reads u • P3 : u=7 • P1 reads u • P2 reads u What do P1 and P2 see in steps 4 and 5? Computer Architecture II

  18. update Write-through 7 Ex: Cache Coherence Problem Solution UPDATE write-through: • before completing P3 :u=7 • send a signal on the bus to update all the copies of the block cache where u is stored ( cache of P1) • Write through the new value to the memory • P1 reads 7 from its cache (the copy has been updated) and P2 will read 7 from main memory • P1 reads u • P3 reads u • P3 : u=7 • P1 reads u • P2 reads u What do P1 and P2 see in steps 4 and 5? 7 Computer Architecture II

  19. Outline: NOW: • Bus based shared memory (chapters 5, 6 Culler) • Cache Coherence • Snooping Cache Coherence Protocols • Consistency • Synchronization LATER: • NUMA machines (chapters 7, 8) • Directory based coherency protocols Computer Architecture II

  20. Memory Systems in Multiprocessors Actual SMPs,UMA, 20-30 processors 80s, UMA, 2-8 processors WE WILL DISCUSS NOW b ! Scalable, UMA, memory far away Most scalable, NUMA Computer Architecture II

  21. Bus-based Cache Coherence • Built on top of two fundamentals of uni-processor systems • Bus transactions • State transition diagram in cache • Uni-processor bus transaction: • Three phases • Arbitration: decide who will get control of the bus • command/address • data transfer • All devices observe addresses, one is responsible for answering • Uni-processor cache states: • A finite-state machine for every block • Write-through: two states: valid, invalid • Write-back: one more state: modified (“dirty”) Computer Architecture II

  22. Snooping-based Coherence • Basic Idea • Transactions on bus are visible to all processors • Processors or their representatives can snoop (monitor) bus and take action on relevant events (e.g. change state) • Implementing a Protocol • Cache controller receives inputs: • Requests from processor • Bus requests/responses from snooper • Cache controller actions on input • Updates state, responds with data, generates new bus transactions • Protocol is a distributed algorithm: cooperating state machines • Set of states, state transition diagram, actions • Granularity of coherence is typically cache block (e.g.: 32 bytes) • The same with granularity of allocation in cache and of transfer to/from cache Computer Architecture II

  23. Coherence with Write-through Caches • Key extensions to uniprocessor: snooping, invalidating/updating caches • no new states or bus transactions in this case • invalidation- versus update-based protocols • Write propagation: even in invalidation case, later reads will see new value • invalidation causes miss on later access, and memory up-to-date via write-through Computer Architecture II

  24. PrRd/- PrWr/BusWr V PrRd/BusRd BusWr/- I PrWr/BusW Processor initiated action Snooper initiated action Invalidation-based write-through State Transition Diagram • One transition diagram per cache block • Block states V:valid, I:invalid • V: block contains a correct copy of main memory • I: block is not in the cache • Notation a/b • a : event • b: action taken on event • Events • Processor can Rd/Wr from/to block • Bus can Rd/Wr from/to block Computer Architecture II

  25. PrRd/- PrWr/BusWr V PrRd/BusRd BusWr/- I PrWr/BusWr Processor initiated action Snooper initiated action Invalidation-based write-through State Transition Diagram • Write will invalidate all other caches (no local change of state) • can have multiple simultaneous readers of block, but write invalidates them • Implementation • Hardware state bits associated only with blocks that are in the cache • other blocks can be seen as being in invalid (not-present) state in that cache Computer Architecture II

  26. Problems with Write Through • High bandwidth requirements • Every write from every processor goes to shared bus and memory • Write-through unpopular for SMPs • Write-back caches absorb most writes as cache hits • Write hits don’t go on bus • But now how do we ensure write propagation and serialization? • Need more sophisticated protocols: large design space Computer Architecture II

  27. Write-Back Snoopy Protocols • No need to change processor, main memory, cache … • Extend cache controller (not only V and I states) and exploit bus (provides serialization) • Dirty state now also indicates exclusive ownership • Exclusive: only cache with a valid copy (main memory may be too) • Owner: responsible for supplying block upon a request for it • 2 types of protocols • Invalidation based • Update based Computer Architecture II

  28. Basic MSI Write-back Invalidation Protocol • States • Invalid (I) • Shared (S): one or more • Dirty or Modified (M): one only • Processor Events: • PrRd (read) • PrWr (write) • Bus Transactions • BusRd (read): asks for copy with no intent to modify • BusRdX (read exclusive): asks for copy with intent to modify • BusWB (write back): updates memory • Actions • Update state, perform bus transaction, flush value onto bus Computer Architecture II

  29. State Transition Diagram PrRd/— PrWr/- • Replacement and write backs are not shown • Rd/Wr in M and Rd in S state do not cause bus transaction • But: Rd/Wr in I state cause 2 bus transactions • Wr in S state 2 bus transactions • And data sent at RdX • Can spare the data transfer, because already have latest data: can use upgrade (BusUpgr) instead of BusRdX M BusRd/Flush PrWr/ BusRdX BusRdX/Flush S BusRdX/— PrRd/BusRd PrRd/— BusRd/— PrW r/BusRdX I Computer Architecture II

  30. P1 P2 P3 PrRd/— PrW r/— PrRd U M U U U U U S S S S M 7 5 5 7 7 BusRd/Flush PrW r/BusRdX BusRd U BusRdX/Flush S BusRd U I/O devices 7 BusRdX/— u :5 BusRdx U PrRd/BusRd Memory PrRd/— BusRd/— PrW r/BusRdX I Example: Write-Back Protocol PrRd U PrRd U PrWr U 7 BusRd Flush • P1 reads u • P3 reads u • P3 : u=7 • P1 reads u • P2 reads u Computer Architecture II

  31. MESI (4-state) Invalidation Protocol • Problem with MSI protocol • Reading and modifying data is 2 bus transactions, even if no sharing! • even in a sequential program! • BusRd (I->S) followed by BusRdX or BusUpgr (S->M) • Add exclusive state: not modified block resides only in local cache => write locally without bus transaction • States • invalid • exclusive or exclusive-clean (only this cache has copy, but not modified) • shared (two or more caches may have copies) • modified (dirty) Computer Architecture II

  32. MESI State Transition Diagram PrRd/- PrWr/- • Goal: no bus transaction on E • When does I go to E and when to S? • I -> E on PrRd if no other processor has a copy • I -> S otherwise • S additional signal on bus • BusRd(S): on BusRd signal, if one processor holds the block it asserts S (makes it 1) M BusRdX/Flush BusRd/Flush PrWr/- PrWr/ BusRdX E BusRd/ Flush BusRdX/Flush PrRd/— PrWr/ S BusRdX ¢ BusRdX/Flush PrRd/ BusRd(S) ) PrRd/— ¢ BusRd/Flush PrRd/ BusRd(S) I Computer Architecture II

  33. Dragon Write-Back Update Protocol • 4 states • Exclusive-clean or exclusive (E): locally and memory have it • Shared clean (Sc): locally, others, and maybe memory, but I’m not owner • Shared modified (Sm): locally and others but not memory, and I’m the owner • Sm and Sc can coexist in different caches, with only one Sm • Modified or dirty (D): locally and nowhere else • No invalid state • If in cache, cannot be invalid • If not present in cache, can view as being in not-present or invalid state • New processor events:PrRdMiss, PrWrMiss • Introduced to specify actions when block not present in cache • New bus transaction: BusUpd • Broadcasts single word written on bus; updates caches that hold a copy Computer Architecture II

  34. Dragon State Transition Diagram PrRd/— PrRd/— BusUpd/Update BusRd/— Sc E PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrW r/— PrW r/BusUpd(S) PrW r/BusUpd(S) BusUpd/Update BusRd/Flush PrW rMiss/BusRd(S) PrW rMiss/(BusRd(S); BusUpd) M Sm PrW r/BusUpd(S) PrRd/— PrRd/— PrW r/BusUpd(S) PrW r/— BusRd/Flush Computer Architecture II

  35. Invalidate versus Update • Basic question of program behavior • Is a block written by one processor read by others before it is rewritten? • Invalidation: • Yes => readers will take a miss • No => multiple writes without additional traffic • Update: • Yes => readers will not miss if they had a copy previously • single bus transaction to update all copies • No => multiple useless updates, even to dead copies • Invalidate or update may be better depending on the application • Invalidation protocols much more popular • Some systems provide both, or even hybrid Computer Architecture II

More Related