620 likes | 700 Views
Explore Flynn's Taxonomy of Parallel Machines, MIMD and multiprocessor architectures, shared vs. distributed memory, cache hierarchies, and coherence protocols in this comprehensive guide.
E N D
Multiprocessing • Flynn’s Taxonomy of Parallel Machines • How many Instruction streams? • How many Data streams? • SISD: Single I Stream, Single D Stream • A uniprocessor • SIMD: Single I, Multiple D Streams • Each “processor” works on its own data • But all execute the same instrs in lockstep • E.g. a vector processor or MMX
Flynn’s Taxonomy • MISD: Multiple I, Single D Stream • Not used much • Stream processors are closest to MISD • MIMD: Multiple I, Multiple D Streams • Each processor executes its own instructions and operates on its own data • This is your typical off-the-shelf multiprocessor(made using a bunch of “normal” processors) • Includes multi-core processors
Multiprocessors • Why do we need multiprocessors? • Uniprocessor speed keeps improving • But there are things that need even more speed • Wait for a few years for Moore’s law to catch up? • Or use multiple processors and do it now? • Multiprocessor software problem • Most code is sequential (for uniprocessors) • MUCH easier to write and debug • Correct parallel code very, very difficult to write • Efficient and correct is even harder • Debugging even more difficult (Heisenbugs) ILP limits reached?
MIMD Multiprocessors Distributed Memory Centralized Shared Memory
Centralized-Memory Machines • Also “Symmetric Multiprocessors” (SMP) • “Uniform Memory Access” (UMA) • All memory locations have similar latencies • Data sharing through memory reads/writes • P1 can write data to a physical address A,P2 can then read physical address A to get that data • Problem: Memory Contention • All processor share the one memory • Memory bandwidth becomes bottleneck • Used only for smaller machines • Most often 2,4, or 8 processors
Distributed-Memory Machines • Two kinds • Distributed Shared-Memory (DSM) • All processors can address all memory locations • Data sharing like in SMP • Also called NUMA (non-uniform memory access) • Latencies of different memory locations can differ(local access faster than remote access) • Message-Passing • A processor can directly address only local memory • To communicate with other processors,must explicitly send/receive messages • Also called multicomputers or clusters • Most accesses local, so less memory contention (can scale to well over 1000 processors)
Bus-based shared memory P P P $ $ $ Memory Fully-connected shared memory Distributed shared memory P P P P P $ $ $ $ $ Memory Memory Interconnection Network Interconnection Network Memory Memory Memory Hierarchy in a Multiprocessor Shared cache P P P Cache Memory
Cache Coherency • Closest cache level is private • Multiple copies of cache line can be present across different processor nodes • Local updates • Lead to incoherent state • Problem exhibits in both write-through and writeback caches • Bus-based globally visible • Point-to-point interconnect visible only to communicated processor nodes
Rd? Rd? X= -100 X= -100 X= -100 Example (Writeback Cache) P P P Cache Cache Cache X= 505 Memory X= -100
Rd? X= 505 X= -100 X= 505 Example (Write-through Cache) P P P Cache Cache Cache X= -100 X= 505 Memory X= -100
Defining Coherence • An MP is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order Implicit definition of coherence • Write propagation • Writes are visible to other processes • Write serialization • All writes to the same location are seen in the same order by all processes (to “all” locations called write atomicity) • E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi
A=1 B=2 T1 A=1 A=1 B=2 B=2 T2 A=1 A=1 B=2 B=2 T3 B=2 A=1 A=1 A=1 B=2 B=2 T3 B=2 B=2 A=1 A=1 See A’s update before B’s See B’s update before A’s Sounds Easy? A=0 B=0 P0 P1 P2 P3
Bus Snooping based on Write-Through Cache • All the writes will be shown as a transaction on the shared bus to memory • Two protocols • Update-based Protocol • Invalidation-based Protocol
Bus Snooping (Update-based Protocol on Write-Through cache) P P P Cache Cache Cache X= -100 X= 505 X= 505 Memory Bus transaction • Each processor’s cache controller constantly snoops on the bus • Update local copies upon snoop hit X= -100 X= 505 Bus snoop
X= 505 Bus Snooping (Invalidation-based Protocol on Write-Through cache) P P P Load X Cache Cache Cache X= -100 X= 505 Memory Bus transaction • Each processor’s cache controller constantly snoops on the bus • Invalidate local copies upon snoop hit X= -100 X= 505 Bus snoop
BusWr / --- PrRd / BusRd Processor-initiated Transaction Bus-snooper-initiated Transaction PrWr / BusWr A Simple invalidate-basedSnoopy Coherence Protocol for a WT, No Write-Allocate Cache PrWr / BusWr PrRd / --- Valid Invalid Observed / Transaction
How about Writeback Cache? • WB cache to reduce bandwidth requirement • The majority of local writes are hidden behind the processor nodes • How to snoop? • Write Ordering
Cache Coherence Protocols for WB caches • A cache has an exclusive copy of a line if • It is the only cache having a valid copy • Memory may or may not have it • Modified (dirty) cache line • The cache having the line is the owner of the line, because it must supply the block
update update Cache Coherence Protocol(Update-based Protocol on Writeback cache) P P P Store X Cache Cache Cache X= 505 X= 505 X= -100 X= -100 X= -100 X= 505 Memory Bus transaction • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred
update update Cache Coherence Protocol(Update-based Protocol on Writeback cache) P P P Store X Load X Cache Cache Cache X= 505 X= 333 X= 333 X= 505 X= 333 X= 505 Hit ! Memory Bus transaction • Update data for all processor nodes who share the same data • For a processor node keeps updating the memory location, a lot of traffic will be incurred
invalidate invalidate Cache Coherence Protocol(Invalidation-based Protocol on Writeback cache) P P P Store X Cache Cache Cache X= -100 X= -100 X= -100 X= 505 Memory Bus transaction • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location
Cache Coherence Protocol(Invalidation-based Protocol on Writeback cache) P P P Load X Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location
Cache Coherence Protocol(Invalidation-based Protocol on Writeback cache) Store X P P P Store X Store X Cache Cache Cache X= 444 X= 505 X= 333 X= 987 X= 505 Memory Bus transaction Bus snoop • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location
MSI Writeback Invalidation Protocol • Modified • Dirty • Only this cache has a valid copy • Shared • Memory is consistent • One or more caches have a valid copy • Invalid • Writeback protocol: A cache line can be written multiple times before the memory is updated.
MSI Writeback Invalidation Protocol • Two types of request from the processor • PrRd • PrWr • Three types of bustransactions post by cache controller • BusRd • PrRd misses the cache • Memory or another cache supplies the line • BusRd eXclusive (Read-to-own) • PrWr is issued to a line which is not in the Modified state • BusWB • Writeback due to replacement • Processor does not directly involve in initiating this operation
PrRd / --- PrRd / --- PrWr / BusRdX PrRd / BusRd MSI Writeback Invalidation Protocol(Processor Request) PrWr / BusRdX PrWr / --- Modified Shared Invalid Processor-initiated
BusRd / Flush BusRd / --- BusRdX / Flush BusRdX / --- MSI Writeback Invalidation Protocol(Bus Transaction) • Flush data on the bus • Both memory and requestor will grab the copy • The requestor get data by • Cache-to-cache transfer; or • Memory Modified Shared Invalid Bus-snooper-initiated
BusRd / Flush BusRd / --- BusRdX / Flush BusRdX / --- BusRd / Flush MSI Writeback Invalidation Protocol(Bus transaction) Another possible implementation • Another possible, valid implementation • Anticipate no more reads from this processor • A performance concern • Save “invalidation” trip if the requesting cache writes the shared line later Modified Shared Invalid Bus-snooper-initiated
MSI Writeback Invalidation Protocol PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated
X=10 S --- --- BusRd Memory S MSI Example P1 P2 P3 Cache Cache Cache Bus BusRd MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X
X=10 X=10 S S BusRd --- --- --- S BusRd BusRd Memory Memory S S MSI Example P1 P2 P3 Cache Cache Cache Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X
--- X=10 I S BusRdX --- --- S --- BusRd BusRd Memory Memory S S MSI Example P1 P2 P3 Cache Cache Cache X=-25 S M X=10 Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X --- M BusRdX Memory I P3 writes X Does not come from memory if having “BusUpgrade”
BusRd --- --- --- S BusRd BusRd Memory Memory S S --- --- M S BusRdX BusRd P3 Cache S I MSI Example P1 P2 P3 Cache Cache Cache S X=-25 --- I X=-25 M S Bus MEMORY X=-25 X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X Memory P3 writes X P1 reads X
X=-25 S BusRd --- --- S --- BusRd BusRd Memory Memory S S --- --- S S S M BusRd BusRd BusRdX P3 Cache Memory S I S MSI Example P1 P2 P3 Cache Cache Cache X=-25 S X=-25 S M Bus MEMORY X=10 X=-25 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 Cache P3 writes X P1 reads X P2 reads X
MESI Writeback Invalidation Protocol • To reduce two types of unnecessary bus transactions • BusRdX that snoops and converts the block from S to M when only you are the sole owner of the block • BusRd that gets the line in S state when there is no sharers (that lead to the overhead above) • Introduce the Exclusive state • One can write to the copy without generating BusRdX • Illinois Protocol: Proposed by Pamarcos and Patel in 1984 • Employed in Intel, PowerPC, MIPS
X=10 E --- --- BusRd(noS) Memory E MESI Example P1 P2 P3 Cache Cache Cache Bus BusRd(noS) MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X
X=10 X=10 S S BusRd --- --- S --- BusRd(noS) BusRd Memory Memory E S MESI Example P1 P2 P3 Cache Cache Cache Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X Does not come from memory if having “BusUpgrade”
--- X=10 I S BusRdX --- --- S --- BusRd(noS) BusRd Memory Memory S E MESI Example P1 P2 P3 Cache Cache Cache X=-25 S M X=10 Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X --- M BusRdX Memory I P3 writes X
BusRd --- --- --- S BusRd(noS) BusRd Memory Memory E S --- --- M S BusRdX BusRd P3 Cache S I MESI Example P1 P2 P3 Cache Cache Cache S X=-25 --- I X=-25 M S Bus MEMORY X=-25 X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X Memory P3 writes X P1 reads X
X=-25 S BusRd --- --- S --- BusRd(noS) BusRd Memory Memory S E --- --- S S S M BusRd BusRd BusRdX P3 Cache Memory S I S MESI Example P1 P2 P3 Cache Cache Cache X=-25 S X=-25 S M Bus MEMORY X=10 X=-25 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 Cache P3 writes X P1 reads X P2 reads X
PrWr / --- PrRd, PrWr / --- PrRd / --- PrWr / BusRdX PrWr / BusRdX PrRd / BusRd (not-S) PrRd / --- PrRd / BusRd (S) MESI Writeback Invalidation ProtocolProcessor Request (Illinois Protocol) Exclusive Modified Invalid Shared S: Shared Signal Processor-initiated
BusRd / Flush Or ---) BusRdX / Flush BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* MESI Writeback Invalidation ProtocolBus Transactions (Illinois Protocol) • Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data • Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory) • Most of the MESI implementations simply write to memory Exclusive Modified Invalid Shared Bus-snooper-initiated Flush*: Flush for data supplier; no action for other sharers
BusRdX / Flush BusRd / Flush BusRdX / Flush BusRd / Flush* BusRdX / Flush* MESI Writeback Invalidation Protocol(Illinois Protocol) PrWr / --- PrRd, PrWr / --- PrRd / --- Exclusive Modified BusRd / Flush (or ---) PrWr / BusRdX PrWr / BusRdX PrRd / BusRd (not-S) Invalid Shared S: Shared Signal PrRd / --- Processor-initiated Bus-snooper-initiated PrRd / BusRd (S) Flush*: Flush for data supplier; no action for other sharers
Add one additional state ─ Owner state Similar to Shared state The O state processor will be responsible for supplying data (copy in memory may be stale) Employed by Sun UltraSparc AMD Opteron In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed CPU0 CPU1 L2 L2 System Request Interface Crossbar Mem Controller Hyper- Transport MOESI Protocol
Implication on Multi-Level Caches • How to guarantee coherence in a multi-level cache hierarchy • Snoop all cache levels? • Intel’s 8870 chipset has a “snoop filter” for quad-core • Maintaining inclusion property • Ensure data in the outer level must be present in the inner level • Only snoop the outermost level (e.g. L2) • L2 needs to know L1 has write hits • Use Write-Through cache • Use Write-back but maintain another “modified-but-stale” bit in L2
Inclusion Property • Not so easy … • Replacement: Different bus observes different access activities, e.g. L2 may replace a line frequently accessed in L1 • Split L1 caches: Imagine all caches are direct-mapped. • Different cache line sizes
Inclusion Property • Use specific cache configurations • E.g., DM L1 + bigger DM or set-associative L2 with the same cache line size • Explicitly propagate L2 action to L1 • L2 replacement will flush the corresponding L1 line • Observed BusRdX bus transaction will invalidate the corresponding L1 line • To avoid excess traffic, L2 maintains an Inclusion bit for filtering (to indicate in L1 or not)