Cache Improvements

Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture

Outline • Introduction • Reactive-Associative Caches • Non-Uniform Cache Architectures • Conclusion / References • Questions

Cache Problem Domains • Hit Time + Miss Rate * Miss Penalty • Hit Time • Time to search and return data • Miss Rate • Amount of times needed data is not in cache and must be fetched from main-memory • Cache Latency • Physical delay to move data from cache to registers

Hit Time / Miss Rate • Searching for cache hits • Using Set-Associative caches causes hit times to increase greatly • Multiple ways need to be checked for a hit and then data in that way needs to be accessed • Miss Rate • Direct-Caches have high miss rates • Very small changes in miss rate can effect performance greatly

Latency / Mapping • Latency • Cache latency is a primary reason for multiple layered, complex architectures • Very difficult to improve due to physical limitations • Mapping • How data is mapped into cache (associative, physical location) • Better mapping heuristics can reduce the average search time and latency

Effects of Cache Changes • Power • More complex cache architectures will use more power to complete tasks • Time • The more complex or larger in size a cache, the slower it will be • Real Estate • Complexity is directly proportional to the number and length of wire traces • Hits / Misses • Each change to cache will impact the hit time and miss rate in some way

Reactive-Associative Caches Joseph Schmigel

Reactive-Associative Caches • Attempts to combine direct-mapped and set-associative cache • Goal is to decrease miss rate while keeping hit times similar to direct-mapped • Avoid disadvantages of each • Direct-mapped has high miss rate • Set-associative has high hit time • Several major parts: • Data array • Tag array • Probes • Way Prediciton • Feedback

Data Array & Tag Arrays • The Data Array is the actual cache that stores data • Data Array has two address mappings, one that is direct-mapped, and one that is set-associative (usually 2, 4, or 8 ways) • The tag array has n tag banks where n is the number of ways. • The tag array is used to store the tags of each set-associative index. • Each tag bank is searched in parallel.

Probes • Two probes (Probe0 & Probe1) are used to signal a hit. • Probe0 performs three steps in parallel • Looks for a direct-mapped hit • Uses way-predicted to find hit • Finds hit in tag array • Probe0 tries to keep hit time equal to that of a direct-mapped hit time – only fails if has to use tag array • Probe1 is only used if Probe0 does not find a direct-mapped hit or way-predicted hit. It then returns a hit if there is a correct match in the set-associative cache.

Probes continued • This means that the following possibilities exist: • Probe0 hits on direct-mapped and Probe 1 is ignored • Probe0 hits on way-prediction and Probe1 is ignored • Probe0 hits using tag array and the Probe1 hits using way found from tag array • Probe0 misses and Probe1 is ignored

Way Prediction • Allows the block to be accessed without performing a tag lookup to obtain the way • Keeps hit times comparable to that of direct-mapped • Must be performed early enough so data can be ready in time for pipeline stage that needs it • Prediction can only use information that is currently available in pipeline • Two types of way prediction were used – XOR and Program Counter

XOR Way Prediction • Calculates the approximate data access by XOR’ing the register value with the instruction offset • Works by assuming that the small memory offsets that are pretty common can be XOR’ed and get a reliable block address to use as a prediction • Cannot be done until late in the pipeline because the registers need to be loaded before performing calculation • More accurate then program counter way prediction

Program Counter Way Prediction • Associates parts of the cache with the program counter • Not as accurate as XOR since the program counter does not access the same memory location all times • Program counter is calculated early in the pipeline so it is easier to make the predictions • Not as accurate as XOR

Feedback • 3 types of feedback • Reactive displacement • Eviction of unpredictable blocks • Eviction of hard to predict blocks • Feedback tries to maximize bandwidth and minimize hit latency • Highly predictable blocks are used in the set-associative cache • Blocks that can not be predicted reliably are kept in direct-mapped cache

Non-Uniform Cache Architectures James Brock

Cache Organization • Multiple Layer Cache • Hierarchical organization designed for faster accesses to layers of cache closer to the core • Replacement policies are static • i.e. – Replacements cause one insertion, one eviction at the same location in cache • Uniform Cache • Cache Architecture is physically laid out in uniformly distributed banks and sub-banks.

AMD64 Cache Design (K8 Core)

Problem Domain • CPU’s are becoming wire-delay dominated • As the core speed of CPU’s increases, the latency of transmission delays has a greater effect on overall performance • 2 possible paths • Reduce the latency of wire traces (physical limitations) • Use latency in the design, and optimize

Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) • All designs were modeled for L2 cache, but can be scaled to work as any layer • Uniform cache latency is as fast as the slowest bank • Non-uniformity in cache uses the latency of (sub)banks further from the decoder for better performance • S-NUCA • Static means that the data in main memory is mapped to 1 … n locations in cache, where n = associativity.

Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) S-NUCA2 S-NUCA1

Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) • S-NUCA 1 • Individual data and address channels for each bank • Multiple banks can be accessed in parallel • HUGE real estate cost to add channels for each bank • S-NUCA 2 • Mesh grid of data and address channels • Switches at each intersection access multiple sub-banks in parallel and arbitrate data flow

Solution 2: Dynamic NUCA(D-NUCA) • Dynamic refers to the ranking and movement of cache lines within the banks and sub-banks • Replacement policy is not a insert & evict • Insertion, Demotion, Eviction based on the replacement heuristic • Example least recently used! • With D-NUCA, mapping, searching, and line movement problems expand

Suggested Mappings

D-NUCA Mapping & Searching • Uses spread sets of banks • # of banks in a set = associativity of the cache • Simple Mapping • Search by set, bank, tags within the set • Some sets are further then others, rows may not be desired number of ways • Fair Mapping • Fixes problems in simple mapping, but more complex • Equal access times to all banks

D-NUCA Mapping & Searching • Shared Mapping • Closest banks are shared with the farthest set • If n sets share a bank, then all banks in the cache are n-way associative • Slightly higher bank associativity offsets average access latency • Cache lines from farther bank sets are located right next to cache controller

D-NUCA Mapping & Searching • Locating cache lines • Incremental Search – one bank at a time • Low power, less messages on cache network • Low performance • Multicast Search – some/all banks at the same time • More power, more network contention • Faster hits to farther banks

D-NUCA Mapping & Searching • Hybrid Searches – combos! • Limited Multicast • Multicast of M banks in each bank set in parallel • M < N • Partitioned Multicast • Similar to multi-level set-associative caches • Each bank set is broken up into subsets • Multicast searches are performed on each subset, starting with the closest subset

D-NUCA Line Movement • Goal of D-NUCA is to maximize hits in the closest banks • LRU policy is applied to mapping lines within a bank • MRU lines is closest to the cache controller • Replacement Policy – Generational Promotion • A cache hit causes that line to be moved one line closer to the cache controller

D-NUCA Line Movement • Generational Promotion (cont’d) • More heavily used lines, thus migrate towards the cache controller • Eviction / Insertion policy shouldn’t simply eject the LRU line and insert the new line in that spot • New lines are inserted towards the middle of the bank set, and allowed to progress forward or back • The victim line can be evicted or simply demoted, with a less important line being evicted.

Performance Improvement

Conclusion • Cache improvements are often more work then the benefits they offer • Complexity causes speed decrease which limits usefulness • Implementing complex caching structure does not usually provide a good cost/benefit ratio for companies • Research still being done and are useful in the theoretical world

References [1] Changkyu Kim, Doug Burger, Stephen Keckler. \textbf{An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches}. Computer Architecture and Technology Laboratory, U of Texas, Austin. [2] http://en.wikipedia.org/wiki/CPUcache [3] B. Batson, and TN. Vijaykumar. Reactive associative caches. In Int. Conf. on Parallel Architectures and Compilation Techniques, Sep. 2001. [4] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Apporach. Morgan Kaufmann, 2003. Third Edition, Chapter Five.

References [5] B. Calder, D. Grunwald, and J. Ermer. Predictive sequential associative cache. In Proceedings of the Second IEEE Symposium on High-Performance Computer Architecture, Feb. 1996 [6] B. Calder, and D. Grunwald. Next cache line and set prediction. In Proceedings of the 20th International Symposium n Computer Architecture, June 1995 [7] A. Agarwal and S. Pudar. Column associative caches: A technique for reducing miss rate of direct-mapped caches. In Proceedings of the 20th International Symposium n Computer Architecture, May 1993

Questions/Comments

Cache Improvements

Cache Improvements

Presentation Transcript

Cache Memory

Cache

Cache memory

Cache Memory

Trace Cache

Cache Memory

Cache

Improvements

Cache

Improvements

Improvements

Cache

Improvements

Cache Memory

Cache Administration

Improvements

Cache Memory

Improvements

Cache

Cache?

Cache

Cache Cache Experience Paris March 2012