Chapter 5: Thread Level Parallelism and Cache Coherence

Chapter 5: Thread Level Parallelism and Cache Coherence Yanmin Zhu Department of Computer Science and Engineering Shanghai Jiao Tong University

Outline • Introduction • Cache Coherence • Snooping-based Protocol

Thread-Level Parallelism • Uses MIMD model • Has multiple program counters • Targeted for tightly-coupled shared-memory multiprocessors • Amount of computation assigned to each thread = grain size • Threads can be used for data-level parallelism, but the overheads may outweigh the benefit

Two Classes of Multiprocessors Depending on the memory organization

Bus-based shared memory P P P $ $ $ Memory Distributed shared memory P P $ $ Memory Memory Interconnection Network Memory Hierarchy in a Multiprocessor Fully-connected shared memory (Dancehall) P P P $ $ $ Interconnection Network Memory Memory

Symmetric Multiprocessors (SMP) • Small number of cores • Share single memory with uniform memory latency

Distributed Shared Memory (DSM) • Memory distributed among processors • Non-uniform memory access/latency (NUMA) • Processors connected via interconnection networks

Multiprocessors vs. Clusters (WSC) Both multiprocessors and clusters follow the MIMD model, but are quite different

Comparisons of Communication • Multiprocessors, both SMP and DSM • Communication among threads occurs through a shared address space • A memory reference can be made by any processor to any memory location • Shared memory means the address space is shared • Clusters and WSCs • Look like individual computers connected by a network • Message-passing protocols are used to communicate data among processors

Challenges of Parallel Processing Two important hurdles make parallel processing challenging • (1) The limited parallelism available in programs • Limitations in available parallelism make it difficult to achieve good speedups in any parallel processors • Solved by good algorithms! • (2) The relatively high cost of communications • Large latency of remote access in a parallel processor • By the architecture (use of cache again) and the programmer

Why Cache Coherency? • In multicores, closest cache level is private • Multiple copies of cache line can be present across different processor nodes • Local updates (writes) lead to incoherent state • Problem exhibits in both write-through and writeback caches Slide from Prof. H.H. Lee in Georgia Tech

read? read? X= 100 X= 100 Writeback Cache w/o Coherence P P P write Cache Cache Cache X= 100 X= 505 Memory X= 100 Slide from Prof. H.H. Lee in Georgia Tech

Read? X= 505 X= 100 X= 505 Writethrough Cache w/o Coherence P P P write Cache Cache Cache X= 100 X= 505 Memory X= 100 Slide from Prof. H.H. Lee in Georgia Tech

Definition of Coherence • Caching shared data introduces a new problem (cache coherence problem) could end up seeing two different values • Because the view of memory held by two different processors is through their individual caches • Or because there is a global state defined by the main memory and a local state defined by the individual caches • Loose definition: a memory system is coherent if any read of a data item returns the most recently written value of that data item

Precise Definition of Coherence • 1) A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P • 2) A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses • 3) Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. Slide from Prof. H.H. Lee in Georgia Tech

Implications • Write propagation • Writes are visible to other processes • Write serialization • All writes to the same location are seen in the same order by all processes • For example, if read operations by P1 to a location see the value produced by write w1 (say, from P2) before the value produced by write w2 (say, from P3), then reads by another process P4 (or P2 or P3) also should not be able to see w2 before w1 Slide from Prof. H.H. Lee in Georgia Tech

A=1 B=2 T1 A=1 A=1 B=2 B=2 T2 A=1 A=1 B=2 B=2 T3 B=2 A=1 A=1 A=1 B=2 B=2 T3 B=2 B=2 A=1 A=1 See A’s update before B’s See B’s update before A’s Sounds Easy? A=0 B=0 P0 P1 P2 P3

Cache Coherence Protocols According to Caching Policies

Bus Snooping based on Write-Through Cache • All the writes will be shown as a transaction on the shared bus to memory • Two protocols • Update-based Protocol • Invalidation-based Protocol Slide from Prof. H.H. Lee in Georgia Tech

Bus Snooping • Update-based Protocol on Write-Through cache P P P write Cache Cache Cache X= 100 X= 505 X= 505 X= 100 Memory Bus transaction X= 100 X= 505 Bus snoop Slide from Prof. H.H. Lee in Georgia Tech

X= 505 Bus Snooping • Invalidation-based Protocol on Write-Through cache P P P Load X write Cache Cache Cache X= 100 X= 100 X= 505 Memory X= 100 X= 505 Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech

Processor-initiated Transaction Bus-snooper-initiated Transaction A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache PrWr / BusWr PrRd / --- Valid PrRd / BusRd BusWr / --- Invalid Observed / Transaction PrWr / BusWr Slide from Prof. H.H. Lee in Georgia Tech

How about Writeback Cache? • WB cache to reduce bandwidth requirement • The majority of local writes are hidden behind the processor nodes • How to snoop? Slide from Prof. H.H. Lee in Georgia Tech

Cache Coherence Protocols for WB Caches • A cache has an exclusivecopy of a line if • It is the only cache having a valid copy • Memory may or may not have it • Modified (dirty) cache line • The cache having the line is the ownerof the line, because it must supply the block Slide from Prof. H.H. Lee in Georgia Tech

update update Update-based Protocol on WB Cache P P P Store X • Update data for all processor nodes who share the same data • Because a processor node keeps updating the memory location, a lot of traffic will be incurred Cache Cache Cache X= 505 X= 505 X= 100 X= 100 X= 505 X= 100 Memory Bus transaction Slide from Prof. H.H. Lee in Georgia Tech

update update Update-based Protocol on WB Cache P P P Store X Load X • Update data for all processor nodes who share the same data • Because a processor node keeps updating the memory location, a lot of traffic will be incurred Cache Cache Cache X= 333 X= 505 X= 333 X= 505 X= 333 X= 505 Hit ! Memory Bus transaction Slide from Prof. H.H. Lee in Georgia Tech

invalidate invalidate Invalidation-based Protocol on WB Cache P P P Store X • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location Cache Cache Cache X= 100 X= 100 X= 505 X= 100 Memory Bus transaction Slide from Prof. H.H. Lee in Georgia Tech

Invalidation-based Protocol on WB Cache P P P Load X • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech

Invalidation-based Protocol on WB Cache Store X P P P • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location Store X Store X Cache Cache Cache X= 444 X= 505 X= 333 X= 987 X= 505 Memory Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech

MSI Writeback Invalidation Protocol • Modified • Dirty • Only this cache has a valid copy • Shared • Memory is consistent • One or more caches have a valid copy • Invalid • Writeback protocol: A cache line can be written multiple times before the memory is updated Slide from Prof. H.H. Lee in Georgia Tech

MSI Writeback Invalidation Protocol • Two types of request from the processor • PrRd • PrWr • Three types of bustransactions post by cache controller • BusRd • PrRd misses the cache • Memory or another cache supplies the line • BusRdeXclusive (Read-to-own) (i.e., invalidation message) • PrWr is issued to a line which is not in the Modified state • BusWB • Writeback due to replacement • Processor does not directly involve in initiating this operation Slide from Prof. H.H. Lee in Georgia Tech

PrRd / --- PrRd / --- PrWr / BusRdX PrRd / BusRd MSI Writeback Invalidation Protocol(Processor Request) PrWr / BusRdX PrWr / --- Modified Shared Invalid Processor-initiated Slide from Prof. H.H. Lee in Georgia Tech

BusRd / Flush BusRd / --- BusRdX / Flush BusRdX / --- MSI Writeback Invalidation Protocol(Bus Transaction) Modified Shared • Flush data on the bus • Both memory and requestor will grab the copy • The requestor get data by • Cache-to-cache transfer; or • Memory Invalid Bus-snooper-initiated Slide from Prof. H.H. Lee in Georgia Tech

MSI Writeback Invalidation Protocol PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated Slide from Prof. H.H. Lee in Georgia Tech

X=10 S --- --- BusRd Memory S MSI Example P1 P2 P3 Cache Cache Cache Bus BusRd MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X Slide from Prof. H.H. Lee in Georgia Tech

X=10 X=10 S S BusRd --- --- S --- BusRd BusRd Memory Memory S S MSI Example P1 P2 P3 Cache Cache Cache Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X Slide from Prof. H.H. Lee in Georgia Tech

X=10 --- S I BusRdX --- --- S --- BusRd BusRd Memory Memory S S --- M BusRdX I MSI Example P1 P2 P3 Cache Cache Cache X=-25 S M X=10 Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X Slide from Prof. H.H. Lee in Georgia Tech

BusRd --- --- --- S BusRd BusRd Memory Memory S S --- --- M S BusRdX BusRd P3 Cache S I MSI Example P1 P2 P3 Cache Cache Cache S X=-25 --- I X=-25 M S Bus MEMORY X=-25 X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X P1 reads X Slide from Prof. H.H. Lee in Georgia Tech

X=-25 S BusRd --- --- S --- BusRd BusRd Memory Memory S S --- --- S S S M BusRd BusRd BusRdX P3 Cache Memory S I S MSI Example P1 P2 P3 Cache Cache Cache X=-25 S X=-25 S M Bus MEMORY X=10 X=-25 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X P1 reads X P2 reads X Slide from Prof. H.H. Lee in Georgia Tech

Chapter 5: Thread Level Parallelism and Cache Coherence

Chapter 5: Thread Level Parallelism and Cache Coherence

Presentation Transcript

Chapter 4: Multiprocessors and Thread-Level Parallelism

CPE 631: Multiprocessors and Thread-Level Parallelism

Chapter 6 Multiprocessors and Thread-Level Parallelism

Chapter 5: Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

Encoding H.264 by Thread Level Parallelism

Multiprocessors and Thread-Level Parallelism

Cache coherence

Programming Explicit Thread-level Parallelism

Cache Coherence

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H

Cache Coherence

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5: Multiprocessors (Thread-Level Parallelism)– Part 2

Cache Coherence

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix E

Chapter 5: Thread Level Parallelism and Cache Coherence

Thread Level Parallelism (TLP)

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism