170 likes | 186 Views
Explore coherence decoupling in cache systems to balance speed and correctness, adapting protocol states for optimal performance in multi-processor environments.
E N D
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004
Coherence / Consistency • Coherence guarantees (i) that a write will • eventually be seen by other processors, and (ii) write • serialization (all processors see writes to the same location • in the same order) • The consistency model defines the ordering of writes and • reads to different memory locations – the hardware • guarantees a certain consistency model and the • programmer attempts to write correct programs with • those assumptions
Consistency Examples Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section critical section Initially, A = B = 0 P1 P2 P3 A = 1 if (A == 1) B = 1 if (B == 1) register = A P1 P2 Data = 2000 while (Head == 0) Head = 1 { } … = Data
Snooping-Based Cache Coherence • Caches share a bus; every cache sees each transaction • in the same cycle; every cache manages itself • When one cache writes to a block, every other cache • invalidates its copy of that block • When a cache has a read miss, the block is provided • by memory or the last writer • Protocols are defined by states: MSI, MESI, MOESI Processor Processor Processor Processor Caches Caches Caches Caches Memory
Directory-Based Cache Coherence • A directory keeps track of the sharing status of each block • Every request goes to the directory and the directory then • sends directives to each cache – the directory is the point • of serialization (just as the bus is, in a snooping protocol) • For example, on a write, the request reaches the directory, • the directory sends invalidates to other sharers, and • permissions are granted to the writer Processor Processor Processor Processor Caches Caches Caches Caches Network Memory Directory
TLDS • A certain ordering of reads and writes is assumed – if that • ordering is violated, the thread is re-executed • The coherence protocol is used to propagate writes Thread 1 Thread 2 Thread 3 Thread 4 Caches Caches Caches Caches Memory
The Traditional Model • No thread is speculative – a parallel application with • synchronization points and parallel regions and guaranteed • to execute correctly with no need for re-execution • Threads wait at synchronization points and wait for the • correct permissions for every block of data Thread 1 Thread 2 Thread 3 Thread 4 Caches Caches Caches Caches Memory
Coherence Decoupling • A simple coherence protocol is often a slow • protocol – for example, a simple protocol may not • allow multiple outstanding requests • Coherence decoupling: maintain a fast and • incorrect protocol; and a slow and correct backing • protocol; incurs fewer stalls in the common case • and occasional recoveries
Coherence Decoupling • A coherence operation is broken into two • components: (i) acquiring and using the value, • (ii) receiving the correct set of permissions
SCL Protocol • Why does speculative cache look-up work? • False sharing: a line was invalidated, but a different word was written to • Silent stores or value locality • If there is spare bandwidth, updated values can be pushed out to sharers
Implementation • The Miss Status Holding Register (MSHR) keeps • track of outstanding requests – it can buffer the • speculative value and ensure it matches the • correct value – on a mis-speculation, that • instruction is treated like a branch mis-predict • Speculation on a coherence operation is no • different from traditional forms of speculation
Summary • Arguments for coherence decoupling: • Reduces protocol complexity • Reduces programming complexity • Marginal hardware overhead • Coherence misses will emerge as greater bottlenecks? • What is the expected trend for CMPs?
Title • Bullet