430 likes | 722 Views
Chapter 5: Thread Level Parallelism and Cache Coherence. Yanmin Zhu Department of Computer Science and Engineering Shanghai Jiao Tong University. Outline. Introduction Cache Coherence Snooping-based Protocol. Thread-Level Parallelism. Uses MIMD model Has multiple program counters
E N D
Chapter 5: Thread Level Parallelism and Cache Coherence Yanmin Zhu Department of Computer Science and Engineering Shanghai Jiao Tong University
Outline • Introduction • Cache Coherence • Snooping-based Protocol
Thread-Level Parallelism • Uses MIMD model • Has multiple program counters • Targeted for tightly-coupled shared-memory multiprocessors • Amount of computation assigned to each thread = grain size • Threads can be used for data-level parallelism, but the overheads may outweigh the benefit
Two Classes of Multiprocessors Depending on the memory organization
Bus-based shared memory P P P $ $ $ Memory Distributed shared memory P P $ $ Memory Memory Interconnection Network Memory Hierarchy in a Multiprocessor Fully-connected shared memory (Dancehall) P P P $ $ $ Interconnection Network Memory Memory
Symmetric Multiprocessors (SMP) • Small number of cores • Share single memory with uniform memory latency
Distributed Shared Memory (DSM) • Memory distributed among processors • Non-uniform memory access/latency (NUMA) • Processors connected via interconnection networks
Multiprocessors vs. Clusters (WSC) Both multiprocessors and clusters follow the MIMD model, but are quite different
Comparisons of Communication • Multiprocessors, both SMP and DSM • Communication among threads occurs through a shared address space • A memory reference can be made by any processor to any memory location • Shared memory means the address space is shared • Clusters and WSCs • Look like individual computers connected by a network • Message-passing protocols are used to communicate data among processors
Challenges of Parallel Processing Two important hurdles make parallel processing challenging • (1) The limited parallelism available in programs • Limitations in available parallelism make it difficult to achieve good speedups in any parallel processors • Solved by good algorithms! • (2) The relatively high cost of communications • Large latency of remote access in a parallel processor • By the architecture (use of cache again) and the programmer
Outline • Introduction • Cache Coherence • Snooping-based Protocol
Why Cache Coherency? • In multicores, closest cache level is private • Multiple copies of cache line can be present across different processor nodes • Local updates (writes) lead to incoherent state • Problem exhibits in both write-through and writeback caches Slide from Prof. H.H. Lee in Georgia Tech
read? read? X= 100 X= 100 Writeback Cache w/o Coherence P P P write Cache Cache Cache X= 100 X= 505 Memory X= 100 Slide from Prof. H.H. Lee in Georgia Tech
Read? X= 505 X= 100 X= 505 Writethrough Cache w/o Coherence P P P write Cache Cache Cache X= 100 X= 505 Memory X= 100 Slide from Prof. H.H. Lee in Georgia Tech
Definition of Coherence • Caching shared data introduces a new problem (cache coherence problem) could end up seeing two different values • Because the view of memory held by two different processors is through their individual caches • Or because there is a global state defined by the main memory and a local state defined by the individual caches • Loose definition: a memory system is coherent if any read of a data item returns the most recently written value of that data item
Precise Definition of Coherence • 1) A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P • 2) A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses • 3) Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. Slide from Prof. H.H. Lee in Georgia Tech
Implications • Write propagation • Writes are visible to other processes • Write serialization • All writes to the same location are seen in the same order by all processes • For example, if read operations by P1 to a location see the value produced by write w1 (say, from P2) before the value produced by write w2 (say, from P3), then reads by another process P4 (or P2 or P3) also should not be able to see w2 before w1 Slide from Prof. H.H. Lee in Georgia Tech
A=1 B=2 T1 A=1 A=1 B=2 B=2 T2 A=1 A=1 B=2 B=2 T3 B=2 A=1 A=1 A=1 B=2 B=2 T3 B=2 B=2 A=1 A=1 See A’s update before B’s See B’s update before A’s Sounds Easy? A=0 B=0 P0 P1 P2 P3
Outline • Introduction • Cache Coherence • Snooping-based Protocol
Bus Snooping based on Write-Through Cache • All the writes will be shown as a transaction on the shared bus to memory • Two protocols • Update-based Protocol • Invalidation-based Protocol Slide from Prof. H.H. Lee in Georgia Tech
Bus Snooping • Update-based Protocol on Write-Through cache P P P write Cache Cache Cache X= 100 X= 505 X= 505 X= 100 Memory Bus transaction X= 100 X= 505 Bus snoop Slide from Prof. H.H. Lee in Georgia Tech
X= 505 Bus Snooping • Invalidation-based Protocol on Write-Through cache P P P Load X write Cache Cache Cache X= 100 X= 100 X= 505 Memory X= 100 X= 505 Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech
Processor-initiated Transaction Bus-snooper-initiated Transaction A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache PrWr / BusWr PrRd / --- Valid PrRd / BusRd BusWr / --- Invalid Observed / Transaction PrWr / BusWr Slide from Prof. H.H. Lee in Georgia Tech
How about Writeback Cache? • WB cache to reduce bandwidth requirement • The majority of local writes are hidden behind the processor nodes • How to snoop? Slide from Prof. H.H. Lee in Georgia Tech
Cache Coherence Protocols for WB Caches • A cache has an exclusivecopy of a line if • It is the only cache having a valid copy • Memory may or may not have it • Modified (dirty) cache line • The cache having the line is the ownerof the line, because it must supply the block Slide from Prof. H.H. Lee in Georgia Tech
update update Update-based Protocol on WB Cache P P P Store X • Update data for all processor nodes who share the same data • Because a processor node keeps updating the memory location, a lot of traffic will be incurred Cache Cache Cache X= 505 X= 505 X= 100 X= 100 X= 505 X= 100 Memory Bus transaction Slide from Prof. H.H. Lee in Georgia Tech
update update Update-based Protocol on WB Cache P P P Store X Load X • Update data for all processor nodes who share the same data • Because a processor node keeps updating the memory location, a lot of traffic will be incurred Cache Cache Cache X= 333 X= 505 X= 333 X= 505 X= 333 X= 505 Hit ! Memory Bus transaction Slide from Prof. H.H. Lee in Georgia Tech
invalidate invalidate Invalidation-based Protocol on WB Cache P P P Store X • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location Cache Cache Cache X= 100 X= 100 X= 505 X= 100 Memory Bus transaction Slide from Prof. H.H. Lee in Georgia Tech
Invalidation-based Protocol on WB Cache P P P Load X • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech
Invalidation-based Protocol on WB Cache Store X P P P • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location Store X Store X Cache Cache Cache X= 444 X= 505 X= 333 X= 987 X= 505 Memory Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech
MSI Writeback Invalidation Protocol • Modified • Dirty • Only this cache has a valid copy • Shared • Memory is consistent • One or more caches have a valid copy • Invalid • Writeback protocol: A cache line can be written multiple times before the memory is updated Slide from Prof. H.H. Lee in Georgia Tech
MSI Writeback Invalidation Protocol • Two types of request from the processor • PrRd • PrWr • Three types of bustransactions post by cache controller • BusRd • PrRd misses the cache • Memory or another cache supplies the line • BusRdeXclusive (Read-to-own) (i.e., invalidation message) • PrWr is issued to a line which is not in the Modified state • BusWB • Writeback due to replacement • Processor does not directly involve in initiating this operation Slide from Prof. H.H. Lee in Georgia Tech
PrRd / --- PrRd / --- PrWr / BusRdX PrRd / BusRd MSI Writeback Invalidation Protocol(Processor Request) PrWr / BusRdX PrWr / --- Modified Shared Invalid Processor-initiated Slide from Prof. H.H. Lee in Georgia Tech
BusRd / Flush BusRd / --- BusRdX / Flush BusRdX / --- MSI Writeback Invalidation Protocol(Bus Transaction) Modified Shared • Flush data on the bus • Both memory and requestor will grab the copy • The requestor get data by • Cache-to-cache transfer; or • Memory Invalid Bus-snooper-initiated Slide from Prof. H.H. Lee in Georgia Tech
MSI Writeback Invalidation Protocol PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated Slide from Prof. H.H. Lee in Georgia Tech
X=10 S --- --- BusRd Memory S MSI Example P1 P2 P3 Cache Cache Cache Bus BusRd MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X Slide from Prof. H.H. Lee in Georgia Tech
X=10 X=10 S S BusRd --- --- S --- BusRd BusRd Memory Memory S S MSI Example P1 P2 P3 Cache Cache Cache Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X Slide from Prof. H.H. Lee in Georgia Tech
X=10 --- S I BusRdX --- --- S --- BusRd BusRd Memory Memory S S --- M BusRdX I MSI Example P1 P2 P3 Cache Cache Cache X=-25 S M X=10 Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X Slide from Prof. H.H. Lee in Georgia Tech
BusRd --- --- --- S BusRd BusRd Memory Memory S S --- --- M S BusRdX BusRd P3 Cache S I MSI Example P1 P2 P3 Cache Cache Cache S X=-25 --- I X=-25 M S Bus MEMORY X=-25 X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X P1 reads X Slide from Prof. H.H. Lee in Georgia Tech
X=-25 S BusRd --- --- S --- BusRd BusRd Memory Memory S S --- --- S S S M BusRd BusRd BusRdX P3 Cache Memory S I S MSI Example P1 P2 P3 Cache Cache Cache X=-25 S X=-25 S M Bus MEMORY X=10 X=-25 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X P1 reads X P2 reads X Slide from Prof. H.H. Lee in Georgia Tech