1 / 60

Understanding Multiprocessor Systems and Cache Coherency

Explore Flynn's Taxonomy of Parallel Machines, MIMD and multiprocessor architectures, shared vs. distributed memory, cache hierarchies, and coherence protocols in this comprehensive guide.

villasenor
Download Presentation

Understanding Multiprocessor Systems and Cache Coherency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chip-Multiprocessor

  2. Multiprocessing • Flynn’s Taxonomy of Parallel Machines • How many Instruction streams? • How many Data streams? • SISD: Single I Stream, Single D Stream • A uniprocessor • SIMD: Single I, Multiple D Streams • Each “processor” works on its own data • But all execute the same instrs in lockstep • E.g. a vector processor or MMX

  3. Flynn’s Taxonomy • MISD: Multiple I, Single D Stream • Not used much • Stream processors are closest to MISD • MIMD: Multiple I, Multiple D Streams • Each processor executes its own instructions and operates on its own data • This is your typical off-the-shelf multiprocessor(made using a bunch of “normal” processors) • Includes multi-core processors

  4. Multiprocessors • Why do we need multiprocessors? • Uniprocessor speed keeps improving • But there are things that need even more speed • Wait for a few years for Moore’s law to catch up? • Or use multiple processors and do it now? • Multiprocessor software problem • Most code is sequential (for uniprocessors) • MUCH easier to write and debug • Correct parallel code very, very difficult to write • Efficient and correct is even harder • Debugging even more difficult (Heisenbugs) ILP limits reached?

  5. MIMD Multiprocessors Distributed Memory Centralized Shared Memory

  6. Centralized-Memory Machines • Also “Symmetric Multiprocessors” (SMP) • “Uniform Memory Access” (UMA) • All memory locations have similar latencies • Data sharing through memory reads/writes • P1 can write data to a physical address A,P2 can then read physical address A to get that data • Problem: Memory Contention • All processor share the one memory • Memory bandwidth becomes bottleneck • Used only for smaller machines • Most often 2,4, or 8 processors

  7. Distributed-Memory Machines • Two kinds • Distributed Shared-Memory (DSM) • All processors can address all memory locations • Data sharing like in SMP • Also called NUMA (non-uniform memory access) • Latencies of different memory locations can differ(local access faster than remote access) • Message-Passing • A processor can directly address only local memory • To communicate with other processors,must explicitly send/receive messages • Also called multicomputers or clusters • Most accesses local, so less memory contention (can scale to well over 1000 processors)

  8. Bus-based shared memory P P P $ $ $ Memory Fully-connected shared memory Distributed shared memory P P P P P $ $ $ $ $ Memory Memory Interconnection Network Interconnection Network Memory Memory Memory Hierarchy in a Multiprocessor Shared cache P P P Cache Memory

  9. What’s the problem of shared cache?

  10. Cache Coherency • Closest cache level is private • Multiple copies of cache line can be present across different processor nodes • Local updates • Lead to incoherent state • Problem exhibits in both write-through and writeback caches • Bus-based  globally visible • Point-to-point interconnect  visible only to communicated processor nodes

  11. Rd? Rd? X= -100 X= -100 X= -100 Example (Writeback Cache) P P P Cache Cache Cache X= 505 Memory X= -100

  12. Rd? X= 505 X= -100 X= 505 Example (Write-through Cache) P P P Cache Cache Cache X= -100 X= 505 Memory X= -100

  13. Defining Coherence • An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence • Write propagation • Writes are visible to other processes • Write serialization • All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity) • E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi

  14. A=1 B=2 T1 A=1 A=1 B=2 B=2 T2 A=1 A=1 B=2 B=2 T3 B=2 A=1 A=1 A=1 B=2 B=2 T3 B=2 B=2 A=1 A=1 See A’s update before B’s See B’s update before A’s Sounds Easy? A=0 B=0 P0 P1 P2 P3

  15. Bus Snooping based on Write-Through Cache • All the writes will be shown as a transaction on the shared bus to memory • Two protocols • Update-based Protocol • Invalidation-based Protocol

  16. Bus Snooping (Update-based Protocol on Write-Through cache) P P P Cache Cache Cache X= -100 X= 505 X= 505 Memory Bus transaction • Each processor’s cache controller constantly snoops on the bus • Update local copies upon snoop hit X= -100 X= 505 Bus snoop

  17. X= 505 Bus Snooping (Invalidation-based Protocol on Write-Through cache) P P P Load X Cache Cache Cache X= -100 X= 505 Memory Bus transaction • Each processor’s cache controller constantly snoops on the bus • Invalidate local copies upon snoop hit X= -100 X= 505 Bus snoop

  18. BusWr / --- PrRd / BusRd Processor-initiated Transaction Bus-snooper-initiated Transaction PrWr / BusWr A Simple invalidate-basedSnoopy Coherence Protocol for a WT, No Write-Allocate Cache PrWr / BusWr PrRd / --- Valid Invalid Observed / Transaction

  19. How about Writeback Cache? • WB cache to reduce bandwidth requirement • The majority of local writes are hidden behind the processor nodes • How to snoop? • Write Ordering

  20. Cache Coherence Protocols for WB caches • A cache has an exclusive copy of a line if • It is the only cache having a valid copy • Memory may or may not have it • Modified (dirty) cache line • The cache having the line is the owner of the line, because it must supply the block

  21. update update Cache Coherence Protocol(Update-based Protocol on Writeback cache) P P P Store X Cache Cache Cache X= 505 X= 505 X= -100 X= -100 X= -100 X= 505 Memory Bus transaction • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred

  22. update update Cache Coherence Protocol(Update-based Protocol on Writeback cache) P P P Store X Load X Cache Cache Cache X= 505 X= 333 X= 333 X= 505 X= 333 X= 505 Hit ! Memory Bus transaction • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred

  23. invalidate invalidate Cache Coherence Protocol(Invalidation-based Protocol on Writeback cache) P P P Store X Cache Cache Cache X= -100 X= -100 X= -100 X= 505 Memory Bus transaction • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location

  24. Cache Coherence Protocol(Invalidation-based Protocol on Writeback cache) P P P Load X Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location

  25. Cache Coherence Protocol(Invalidation-based Protocol on Writeback cache) Store X P P P Store X Store X Cache Cache Cache X= 444 X= 505 X= 333 X= 987 X= 505 Memory Bus transaction Bus snoop • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location

  26. MSI Writeback Invalidation Protocol • Modified • Dirty • Only this cache has a valid copy • Shared • Memory is consistent • One or more caches have a valid copy • Invalid • Writeback protocol: A cache line can be written multiple times before the memory is updated.

  27. MSI Writeback Invalidation Protocol • Two types of request from the processor • PrRd • PrWr • Three types of bustransactions post by cache controller • BusRd • PrRd misses the cache • Memory or another cache supplies the line • BusRd eXclusive (Read-to-own) • PrWr is issued to a line which is not in the Modified state • BusWB • Writeback due to replacement • Processor does not directly involve in initiating this operation

  28. PrRd / --- PrRd / --- PrWr / BusRdX PrRd / BusRd MSI Writeback Invalidation Protocol(Processor Request) PrWr / BusRdX PrWr / --- Modified Shared Invalid Processor-initiated

  29. BusRd / Flush BusRd / --- BusRdX / Flush BusRdX / --- MSI Writeback Invalidation Protocol(Bus Transaction) • Flush data on the bus • Both memory and requestor will grab the copy • The requestor get data by • Cache-to-cache transfer; or • Memory Modified Shared Invalid Bus-snooper-initiated

  30. BusRd / Flush BusRd / --- BusRdX / Flush BusRdX / --- BusRd / Flush MSI Writeback Invalidation Protocol(Bus transaction) Another possible implementation • Another possible, valid implementation • Anticipate no more reads from this processor • A performance concern • Save “invalidation” trip if the requesting cache writes the shared line later Modified Shared Invalid Bus-snooper-initiated

  31. MSI Writeback Invalidation Protocol PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated

  32. X=10 S --- --- BusRd Memory S MSI Example P1 P2 P3 Cache Cache Cache Bus BusRd MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X

  33. X=10 X=10 S S BusRd --- --- --- S BusRd BusRd Memory Memory S S MSI Example P1 P2 P3 Cache Cache Cache Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X

  34. --- X=10 I S BusRdX --- --- S --- BusRd BusRd Memory Memory S S MSI Example P1 P2 P3 Cache Cache Cache X=-25 S M X=10 Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X --- M BusRdX Memory I P3 writes X Does not come from memory if having “BusUpgrade”

  35. BusRd --- --- --- S BusRd BusRd Memory Memory S S --- --- M S BusRdX BusRd P3 Cache S I MSI Example P1 P2 P3 Cache Cache Cache S X=-25 --- I X=-25 M S Bus MEMORY X=-25 X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X Memory P3 writes X P1 reads X

  36. X=-25 S BusRd --- --- S --- BusRd BusRd Memory Memory S S --- --- S S S M BusRd BusRd BusRdX P3 Cache Memory S I S MSI Example P1 P2 P3 Cache Cache Cache X=-25 S X=-25 S M Bus MEMORY X=10 X=-25 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 Cache P3 writes X P1 reads X P2 reads X

  37. What’s not good about MSI?

  38. MESI Writeback Invalidation Protocol • To reduce two types of unnecessary bus transactions • BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block • BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) • Introduce the Exclusive state • One can write to the copy without generating BusRdX • Illinois Protocol: Proposed by Pamarcos and Patel in 1984 • Employed in Intel, PowerPC, MIPS

  39. X=10 E --- --- BusRd(noS) Memory E MESI Example P1 P2 P3 Cache Cache Cache Bus BusRd(noS) MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X

  40. X=10 X=10 S S BusRd --- --- S --- BusRd(noS) BusRd Memory Memory E S MESI Example P1 P2 P3 Cache Cache Cache Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X Does not come from memory if having “BusUpgrade”

  41. --- X=10 I S BusRdX --- --- S --- BusRd(noS) BusRd Memory Memory S E MESI Example P1 P2 P3 Cache Cache Cache X=-25 S M X=10 Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X --- M BusRdX Memory I P3 writes X

  42. BusRd --- --- --- S BusRd(noS) BusRd Memory Memory E S --- --- M S BusRdX BusRd P3 Cache S I MESI Example P1 P2 P3 Cache Cache Cache S X=-25 --- I X=-25 M S Bus MEMORY X=-25 X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X Memory P3 writes X P1 reads X

  43. X=-25 S BusRd --- --- S --- BusRd(noS) BusRd Memory Memory S E --- --- S S S M BusRd BusRd BusRdX P3 Cache Memory S I S MESI Example P1 P2 P3 Cache Cache Cache X=-25 S X=-25 S M Bus MEMORY X=10 X=-25 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 Cache P3 writes X P1 reads X P2 reads X

  44. PrWr / --- PrRd, PrWr / --- PrRd / --- PrWr / BusRdX PrWr / BusRdX PrRd / BusRd (not-S) PrRd / --- PrRd / BusRd (S) MESI Writeback Invalidation ProtocolProcessor Request (Illinois Protocol) Exclusive Modified Invalid Shared S: Shared Signal Processor-initiated

  45. BusRd / Flush Or ---) BusRdX / Flush BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* MESI Writeback Invalidation ProtocolBus Transactions (Illinois Protocol) • Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data • Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) • Most of the MESI implementations simply write to memory Exclusive Modified Invalid Shared Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers

  46. BusRdX / Flush BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* MESI Writeback Invalidation Protocol(Illinois Protocol) PrWr / --- PrRd, PrWr / --- PrRd / --- Exclusive Modified BusRd / Flush (or ---) PrWr / BusRdX PrWr / BusRdX PrRd / BusRd (not-S) Invalid Shared S: Shared Signal PrRd / --- Processor-initiated Bus-snooper-initiated PrRd / BusRd (S) Flush*: Flush for data supplier; no action for other sharers

  47. Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by Sun UltraSparc AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 CPU1 L2 L2 System Request Interface Crossbar Mem Controller Hyper- Transport MOESI Protocol

  48. Implication on Multi-Level Caches • How to guarantee coherence in a multi-level cache hierarchy • Snoop all cache levels? • Intel’s 8870 chipset has a “snoop filter” for quad-core • Maintaining inclusion property • Ensure data in the outer level must be present in the inner level • Only snoop the outermost level (e.g. L2) • L2 needs to know L1 has write hits • Use Write-Through cache • Use Write-back but maintain another “modified-but-stale” bit in L2

  49. Inclusion Property • Not so easy … • Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1 • Split L1 caches: Imagine all caches are direct-mapped. • Different cache line sizes

  50. Inclusion Property • Use specific cache configurations • E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size • Explicitly propagate L2 action to L1 • L2 replacement will flush the corresponding L1 line • Observed BusRdX bus transaction will invalidate the corresponding L1 line • To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not)

More Related