1 / 41

Chapter 5: Thread Level Parallelism and Cache Coherence

Chapter 5: Thread Level Parallelism and Cache Coherence. Yanmin Zhu Department of Computer Science and Engineering Shanghai Jiao Tong University. Outline. Introduction Cache Coherence Snooping-based Protocol. Thread-Level Parallelism. Uses MIMD model Has multiple program counters

raiden
Download Presentation

Chapter 5: Thread Level Parallelism and Cache Coherence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 5: Thread Level Parallelism and Cache Coherence Yanmin Zhu Department of Computer Science and Engineering Shanghai Jiao Tong University

  2. Outline • Introduction • Cache Coherence • Snooping-based Protocol

  3. Thread-Level Parallelism • Uses MIMD model • Has multiple program counters • Targeted for tightly-coupled shared-memory multiprocessors • Amount of computation assigned to each thread = grain size • Threads can be used for data-level parallelism, but the overheads may outweigh the benefit

  4. Two Classes of Multiprocessors Depending on the memory organization

  5. Bus-based shared memory P P P $ $ $ Memory Distributed shared memory P P $ $ Memory Memory Interconnection Network Memory Hierarchy in a Multiprocessor Fully-connected shared memory (Dancehall) P P P $ $ $ Interconnection Network Memory Memory

  6. Symmetric Multiprocessors (SMP) • Small number of cores • Share single memory with uniform memory latency

  7. Distributed Shared Memory (DSM) • Memory distributed among processors • Non-uniform memory access/latency (NUMA) • Processors connected via interconnection networks

  8. Multiprocessors vs. Clusters (WSC) Both multiprocessors and clusters follow the MIMD model, but are quite different

  9. Comparisons of Communication • Multiprocessors, both SMP and DSM • Communication among threads occurs through a shared address space • A memory reference can be made by any processor to any memory location • Shared memory means the address space is shared • Clusters and WSCs • Look like individual computers connected by a network • Message-passing protocols are used to communicate data among processors

  10. Challenges of Parallel Processing Two important hurdles make parallel processing challenging • (1) The limited parallelism available in programs • Limitations in available parallelism make it difficult to achieve good speedups in any parallel processors • Solved by good algorithms! • (2) The relatively high cost of communications • Large latency of remote access in a parallel processor • By the architecture (use of cache again) and the programmer

  11. Outline • Introduction • Cache Coherence • Snooping-based Protocol

  12. Why Cache Coherency? • In multicores, closest cache level is private • Multiple copies of cache line can be present across different processor nodes • Local updates (writes) lead to incoherent state • Problem exhibits in both write-through and writeback caches Slide from Prof. H.H. Lee in Georgia Tech

  13. read? read? X= 100 X= 100 Writeback Cache w/o Coherence P P P write Cache Cache Cache X= 100 X= 505 Memory X= 100 Slide from Prof. H.H. Lee in Georgia Tech

  14. Read? X= 505 X= 100 X= 505 Writethrough Cache w/o Coherence P P P write Cache Cache Cache X= 100 X= 505 Memory X= 100 Slide from Prof. H.H. Lee in Georgia Tech

  15. Definition of Coherence • Caching shared data introduces a new problem (cache coherence problem) could end up seeing two different values • Because the view of memory held by two different processors is through their individual caches • Or because there is a global state defined by the main memory and a local state defined by the individual caches • Loose definition: a memory system is coherent if any read of a data item returns the most recently written value of that data item

  16. Precise Definition of Coherence • 1) A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P • 2) A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses • 3) Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. Slide from Prof. H.H. Lee in Georgia Tech

  17. Implications • Write propagation • Writes are visible to other processes • Write serialization • All writes to the same location are seen in the same order by all processes • For example, if read operations by P1 to a location see the value produced by write w1 (say, from P2) before the value produced by write w2 (say, from P3), then reads by another process P4 (or P2 or P3) also should not be able to see w2 before w1 Slide from Prof. H.H. Lee in Georgia Tech

  18. A=1 B=2 T1 A=1 A=1 B=2 B=2 T2 A=1 A=1 B=2 B=2 T3 B=2 A=1 A=1 A=1 B=2 B=2 T3 B=2 B=2 A=1 A=1 See A’s update before B’s See B’s update before A’s Sounds Easy? A=0 B=0 P0 P1 P2 P3

  19. Cache Coherence Protocols According to Caching Policies

  20. Outline • Introduction • Cache Coherence • Snooping-based Protocol

  21. Bus Snooping based on Write-Through Cache • All the writes will be shown as a transaction on the shared bus to memory • Two protocols • Update-based Protocol • Invalidation-based Protocol Slide from Prof. H.H. Lee in Georgia Tech

  22. Bus Snooping • Update-based Protocol on Write-Through cache P P P write Cache Cache Cache X= 100 X= 505 X= 505 X= 100 Memory Bus transaction X= 100 X= 505 Bus snoop Slide from Prof. H.H. Lee in Georgia Tech

  23. X= 505 Bus Snooping • Invalidation-based Protocol on Write-Through cache P P P Load X write Cache Cache Cache X= 100 X= 100 X= 505 Memory X= 100 X= 505 Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech

  24. Processor-initiated Transaction Bus-snooper-initiated Transaction A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache PrWr / BusWr PrRd / --- Valid PrRd / BusRd BusWr / --- Invalid Observed / Transaction PrWr / BusWr Slide from Prof. H.H. Lee in Georgia Tech

  25. How about Writeback Cache? • WB cache to reduce bandwidth requirement • The majority of local writes are hidden behind the processor nodes • How to snoop? Slide from Prof. H.H. Lee in Georgia Tech

  26. Cache Coherence Protocols for WB Caches • A cache has an exclusivecopy of a line if • It is the only cache having a valid copy • Memory may or may not have it • Modified (dirty) cache line • The cache having the line is the ownerof the line, because it must supply the block Slide from Prof. H.H. Lee in Georgia Tech

  27. update update Update-based Protocol on WB Cache P P P Store X • Update data for all processor nodes who share the same data • Because a processor node keeps updating the memory location, a lot of traffic will be incurred Cache Cache Cache X= 505 X= 505 X= 100 X= 100 X= 505 X= 100 Memory Bus transaction Slide from Prof. H.H. Lee in Georgia Tech

  28. update update Update-based Protocol on WB Cache P P P Store X Load X • Update data for all processor nodes who share the same data • Because a processor node keeps updating the memory location, a lot of traffic will be incurred Cache Cache Cache X= 333 X= 505 X= 333 X= 505 X= 333 X= 505 Hit ! Memory Bus transaction Slide from Prof. H.H. Lee in Georgia Tech

  29. invalidate invalidate Invalidation-based Protocol on WB Cache P P P Store X • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location Cache Cache Cache X= 100 X= 100 X= 505 X= 100 Memory Bus transaction Slide from Prof. H.H. Lee in Georgia Tech

  30. Invalidation-based Protocol on WB Cache P P P Load X • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location Cache Cache Cache X= 505 X= 505 Miss ! Snoop hit Memory Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech

  31. Invalidation-based Protocol on WB Cache Store X P P P • Invalidate the data copies for the sharing processor nodes • Reduced traffic when a processor node keeps updating the same memory location Store X Store X Cache Cache Cache X= 444 X= 505 X= 333 X= 987 X= 505 Memory Bus transaction Bus snoop Slide from Prof. H.H. Lee in Georgia Tech

  32. MSI Writeback Invalidation Protocol • Modified • Dirty • Only this cache has a valid copy • Shared • Memory is consistent • One or more caches have a valid copy • Invalid • Writeback protocol: A cache line can be written multiple times before the memory is updated Slide from Prof. H.H. Lee in Georgia Tech

  33. MSI Writeback Invalidation Protocol • Two types of request from the processor • PrRd • PrWr • Three types of bustransactions post by cache controller • BusRd • PrRd misses the cache • Memory or another cache supplies the line • BusRdeXclusive (Read-to-own) (i.e., invalidation message) • PrWr is issued to a line which is not in the Modified state • BusWB • Writeback due to replacement • Processor does not directly involve in initiating this operation Slide from Prof. H.H. Lee in Georgia Tech

  34. PrRd / --- PrRd / --- PrWr / BusRdX PrRd / BusRd MSI Writeback Invalidation Protocol(Processor Request) PrWr / BusRdX PrWr / --- Modified Shared Invalid Processor-initiated Slide from Prof. H.H. Lee in Georgia Tech

  35. BusRd / Flush BusRd / --- BusRdX / Flush BusRdX / --- MSI Writeback Invalidation Protocol(Bus Transaction) Modified Shared • Flush data on the bus • Both memory and requestor will grab the copy • The requestor get data by • Cache-to-cache transfer; or • Memory Invalid Bus-snooper-initiated Slide from Prof. H.H. Lee in Georgia Tech

  36. MSI Writeback Invalidation Protocol PrWr / BusRdX PrWr / --- PrRd / --- BusRd / Flush BusRd / --- Modified Shared PrRd / --- BusRdX / Flush BusRdX / --- PrWr / BusRdX Invalid PrRd / BusRd Processor-initiated Bus-snooper-initiated Slide from Prof. H.H. Lee in Georgia Tech

  37. X=10 S --- --- BusRd Memory S MSI Example P1 P2 P3 Cache Cache Cache Bus BusRd MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X Slide from Prof. H.H. Lee in Georgia Tech

  38. X=10 X=10 S S BusRd --- --- S --- BusRd BusRd Memory Memory S S MSI Example P1 P2 P3 Cache Cache Cache Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X Slide from Prof. H.H. Lee in Georgia Tech

  39. X=10 --- S I BusRdX --- --- S --- BusRd BusRd Memory Memory S S --- M BusRdX I MSI Example P1 P2 P3 Cache Cache Cache X=-25 S M X=10 Bus MEMORY X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X Slide from Prof. H.H. Lee in Georgia Tech

  40. BusRd --- --- --- S BusRd BusRd Memory Memory S S --- --- M S BusRdX BusRd P3 Cache S I MSI Example P1 P2 P3 Cache Cache Cache S X=-25 --- I X=-25 M S Bus MEMORY X=-25 X=10 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X P1 reads X Slide from Prof. H.H. Lee in Georgia Tech

  41. X=-25 S BusRd --- --- S --- BusRd BusRd Memory Memory S S --- --- S S S M BusRd BusRd BusRdX P3 Cache Memory S I S MSI Example P1 P2 P3 Cache Cache Cache X=-25 S X=-25 S M Bus MEMORY X=10 X=-25 Processor Action State in P2 State in P3 Bus Transaction Data Supplier State in P1 P1 reads X P3 reads X P3 writes X P1 reads X P2 reads X Slide from Prof. H.H. Lee in Georgia Tech

More Related