650 likes | 836 Views
The Locality-Aware Adaptive Cache Coherence Protocol. George Kurian 1 , Omer Khan 2 , Srini Devadas 1 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs. Cache Hierarchy Organization Directory-Based Coherence. Private cache Write miss.
E N D
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian1, Omer Khan2, SriniDevadas1 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs
Cache Hierarchy OrganizationDirectory-Based Coherence Private cache Write miss • Private caches: 1 or 2 levels • Shared cache: Last-level Write word Sharer 1 3 Shared Cache + Directory 2 • Concurrent reads lead to replication in private caches • Directory maintains coherence for replicated lines 4 Sharer
Private CachingAdvantages & Drawbacks • Inefficientlyhandles datawith LOW spatio-temporal locality • Working set > privatecache size • Inefficient cache utilization (Cache thrashing) • Unnecessary fetch of entire cache line • Shared data replication increases working set • Exploits spatio- temporal locality • Efficient low-latency local access to private + shared data (cache line replication)
Private CachingAdvantages & Drawbacks • Inefficientlyhandles datawith LOW spatio-temporal locality • Working set > privatecache size • Shared data with frequent writes • Wasteful invalidations, synchronous writebacks, cache line ping-ponging • Exploits spatio-temporal locality • Efficient low-latency local access to private + shared data (cache line replication) Increased on-chip communication and time spent waiting for expensive events
On-Chip Communication Problem Bill Dally, Stanford Shekhar Borkar, Intel • Wires relative to gates are getting worse every generation Bit movement is much more expensive than computation Must Architect Efficient Coherence Protocols
Locality of BenchmarksEvaluating Reuse before Evictions • Utilization: # private L1 cache accesses before cache line is evicted • 40% of lines evicted have a utilization < 4 80% 20%
Locality of BenchmarksEvaluating Reuse before Invalidations • Utilization: # private L1 cache accesses before cache line is invalidated (intervening write) 80% 10%
Remote-Word Access (RA) • Assign each memory address to unique “home” core • Cache line present only in shared cache at “home” core (single location) • For access to non-locally cached word, request “remote” shared cache on “home” core to perform the read/write access Homecore 2 1 Write word NUCA-based protocol [Fensch et al HPCA’08] [Hoffmann et al HiPEAC’10]
Remote-Word AccessAdvantages & Drawbacks • Round-trip network request for remote-WORD access • Expensive for high locality data • Data placement dictates distance & frequency of remote accesses • Energy Efficient(low locality data) Word access (~200 bits) cheaper than cache line fetch (~640 bits) • NO data replication Efficient private cache utilization • NO invalidations / synchronous writebacks
Locality-Aware Cache Coherence • Combine advantages of private caching and remote access • Privately cache high locality lines • Optimize hit latency and energy • Remotely cache low locality lines • Prevent data replication & costly data movement • Private Caching Threshold (PCT) • Utilization >= PCT Mark as private • Utilization < PCT Mark as remote
Locality-Aware Cache Coherence • Private Caching Theshold (PCT) = 4 Private Remote Invalidations vs Utilization
Outline • Motivation for Locality-Aware Coherence • Detailed Implementation • Optimizations • Evaluation • Conclusion
Baseline System Core M Compute Pipeline L1 D-Cache L1 I-Cache M L2 Shared Cache Directory M Router • Compute pipeline • Private L1-I and L1-D caches • Logically shared physically distributed L2 cache with integrated directory • L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09] • ACKwise limited-directory protocol [Kurian– PACT10]
Locality-Aware CoherenceImportant Features • Intelligent allocation of cache lines • In the private L1 cache • Allocation decision made per-core at cache line level • Efficient locality tracking hardware • Decoupled from traditional coherence tracking structures • Protocol complexity low • NO additional networks for deadlock avoidance
Implementation DetailsPrivate Cache Line Tag • Private Utilization bits to track cache line usage in L1 cache • Communicated back to directory on eviction or invalidation • Storage overhead is only 0.4% State LRU Tag Private Utilization
Implementation DetailsDirectory Entry State ACKwise Pointers 1 … p Tag P/R1 … P/Rn • P/Ri: Private/Remote Mode • Remote-Utilizationi: Line usage by Coreiat shared L2 cache • Complete Locality Classifier: Track mode/remote-utilization for all cores • Storage overhead reduced later Remote Utilization1 Remote Utilizationn …
Mode Transitions Summary • Classification based on previous behavior Remote Utilization < PCT Private Utilization < PCT Initial Private Remote Private Utilization >= PCT Remote Utilization >= PCT
Walk Through Example Core A Private Caching Threshold PCT = 2 All cores start out in private mode Pipeline + L1 Cache Network Core B Core C Pipeline + L1 Cache Pipeline + L1 Cache Directory Core-A Private U Core-B Private U Core-C Private U Core D Uncached L2 Cache + Directory
Walk Through Example Core A PCT = 2 Read[X] Core B Core C Directory Core-A Private U Core-B Private U Core-C Private U Core D Uncached
Walk Through Example Core A PCT = 2 Core B Core C Directory Cache Line [X] Core-A Private C Core-B Private U Core-C Private U Core D Shared Clean -
Walk Through Example Core A PCT = 2 Shared1 Cache Line [X] Core B Core C Directory Core-A Private C Core-B Private U Core-C Private U Core D Shared Clean -
Walk Through Example Core A PCT = 2 Shared1 Core B Core C Read[X] Directory Core-A Private C Core-B Private U Core-C Private U Core D Shared Clean -
Walk Through Example Core A PCT = 2 Shared1 Core B Core C Directory Cache Line [X] Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -
Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared1 Cache Line [X] Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -
Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared1 Read[X] Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -
Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared2 Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -
Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared2 Write[X] Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -
Walk Through Example Core A PCT = 2 Shared1 Core B Core C Shared2 Directory Inv [X] Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -
Walk Through Example Core A PCT = 2 Invalid 0 Inv-Reply [X] (1) Core B Core C Shared2 Directory Core-A Private C Core-B Private U Core-C Private C Core D Shared Clean -
Walk Through Example Core A PCT = 2 Core B Core C Shared2 Inv-Reply [X] (1) Directory Core-A Remote 0 Core-B Private U Core-C Private C Core D Shared Clean -
Walk Through Example Core A PCT = 2 Core B Core C Invalid 0 Inv-Reply [X] (2) Directory Core-A Remote 0 Core-B Private U Core-C Private C Core D Shared Clean -
Walk Through Example Core A PCT = 2 Core B Core C Inv-Reply [X] (2) Directory Core-A Remote 0 Core-B Private U Core-C Private U Core D Uncached Clean -
Walk Through Example Core A PCT = 2 Core B Core C Directory Cache Line [X] Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -
Walk Through Example Core A PCT = 2 Core B Core C Modified 1 Cache Line [X] Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -
Walk Through Example Core A PCT = 2 Read[X] Core B Core C Modified 1 Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -
Walk Through Example Core A PCT = 2 Core B Core C Modified 1 Directory WB [X] Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -
Walk Through Example Core A PCT = 2 Core B Core C Shared 1 WB-Reply [X] Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Clean -
Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Directory WB-Reply [X] Core-A Remote 0 Core-B Private C Core-C Private U Core D Shared Dirty -
Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Directory Word [X] Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -
Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Write [X] Directory Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -
Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Upgrade-Reply [X] Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Dirty -
Walk Through Example Core A PCT = 2 Core B Core C Modified 2 Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Modified Dirty -
Walk Through Example Core A PCT = 2 Read [X] Core B Core C Modified 2 Directory Core-A Remote 0 Core-B Private C Core-C Private U Core D Shared Dirty -
Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Directory Read [X] Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -
Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Directory Word [X] Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -
Walk Through Example Core A PCT = 2 Read [X] Core B Core C Shared 2 Directory Core-A Remote 1 Core-B Private C Core-C Private U Core D Shared Dirty -
Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Directory Read [X] Core-A Remote 2 Core-B Private C Core-C Private U Core D Shared Dirty -
Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Cache Line [X] (2) Directory Core-A Private C Core-B Private C Core-C Private U Core D Shared Dirty -
Walk Through Example Core A PCT = 2 Shared 2 Cache Line [X] (2) Core B Core C Shared 2 Directory Core-A Private C Core-B Private C Core-C Private U Core D Shared Dirty -
Outline • Motivation for Locality-Aware Coherence • Detailed Implementation • Optimizations • Evaluation • Conclusion