200 likes | 312 Views
Coherence Decoupling: Making Use of Incoherence. J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004. Motivation. Multi-threading and Multi-processing have become common When a cache line is marked as invalid very often not all data in the line is incorrect
E N D
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004
Motivation • Multi-threading and Multi-processing have become common • When a cache line is marked as invalid very often not all data in the line is incorrect • If the data in invalid lines can be used speculatively there is a great potential for performance improvement
Background Cache Coherence Protocol • Used in shared-memory multiprocessors for managing correct data sharing • Vital to the design of multiprocessors since it contributes the most to inter-processor communication latency
Proposed Idea • Separate the traditional cache coherence protocol into two parts • Speculative cache lookup (SCL) – uses a speculative value from an invalid cache line thus allowing the processor to work continuously • Safe coherence protocol – obtains the correct value which is then compared with the value provided by SCL
Related Work • Customized Coherence Protocols • Speculative Coherence Operations Dynamic self-invalidation, coherence message predictor, token coherence etc. • Speculation on outcome of events in multi-processor execution
Coherence Decoupling Architecture Must support the following: • Split - means to split a memory op into speculative load and a coherence operation • Compute -mechanisms to support execution with speculative values • Recover – means to recover and rollback upon misprediction
SCL Protocols for Coherence Decoupling • Use a simple safe coherence protocol and rely on an aggressive SCL protocol to increase performance • Two components of an SCL protocol • Read component – obtains the speculative value • Update component – updates an invalid cache line so subsequent speculative reads can use it (can be left out in some SCL protocols)
Read vs Update components • SCL protocol with only a read component can be used if the word in an invalid block has: • Not changed remotely (false sharing) • Changed remotely to a same value (silent stores) • Changed remotely to a different value and then back to the original value (temporally silent stores) • For truly-shared data an update component needs to be added • Speculatively sends data around the system by writing it into invalid cache lines
SCL protocol Read component • CD - Use the locally cached incoherent value for every L2 miss Simple but since it is triggered on every load operation it could produce many mis-speculations • CD-F - Add a PC-indexed confidence predictor to filter speculations Reduces the number of (mis)speculative reads thus improving the average accuracy
SCL protocol Update component • CD-IA Use invalidation piggyback to update all invalid blocks • CD-C Use invalidation piggyback if the value is compressed
SCL protocol Update component (Ctd.) • CD-N - Update all sharers after N writes to a block Increases the number of messages (bandwidth) • CD-W - Update on every write if any sharers exist CD assumed wherever Write update is being used
Methodology • Simulator MP-Sauce & SimpleScalar • 16-node SMP systems simulated • Coherence protocol used – simple invalidation snooping-bus protocol • 3 commercial applications and 5 scientific shared memory SPLASH2 suite benchmarks simulated
Results - Microbenchmarks Simple-fs – loads falsely shared data and then executes (in)dependent instructions Critical-fs – forces data dependence between two loads by placing consecutive false sharing misses in critical path
Coherence Decoupling Accuracy Results CD, CD-F, CD-IA, CD-C, CD-N, CD-W
Latency Tolerance Profiles • Executed instructions during coherence decoupling • The number of control dependent instructions will grow in future processors
Conclusions • Coherence Misses – significant fraction of L2 misses ranging from 10% to 80% • Coherence Decoupling has the potential to hide the miss latency for 40% to 90% of coherence misses • Mis-speculation occurs 20% of the time