350 likes | 466 Views
Cache Improvements. James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture. Outline. Introduction Reactive-Associative Caches Non-Uniform Cache Architectures Conclusion / References Questions. Cache Problem Domains. Hit Time + Miss Rate * Miss Penalty Hit Time
E N D
Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture
Outline • Introduction • Reactive-Associative Caches • Non-Uniform Cache Architectures • Conclusion / References • Questions
Cache Problem Domains • Hit Time + Miss Rate * Miss Penalty • Hit Time • Time to search and return data • Miss Rate • Amount of times needed data is not in cache and must be fetched from main-memory • Cache Latency • Physical delay to move data from cache to registers
Hit Time / Miss Rate • Searching for cache hits • Using Set-Associative caches causes hit times to increase greatly • Multiple ways need to be checked for a hit and then data in that way needs to be accessed • Miss Rate • Direct-Caches have high miss rates • Very small changes in miss rate can effect performance greatly
Latency / Mapping • Latency • Cache latency is a primary reason for multiple layered, complex architectures • Very difficult to improve due to physical limitations • Mapping • How data is mapped into cache (associative, physical location) • Better mapping heuristics can reduce the average search time and latency
Effects of Cache Changes • Power • More complex cache architectures will use more power to complete tasks • Time • The more complex or larger in size a cache, the slower it will be • Real Estate • Complexity is directly proportional to the number and length of wire traces • Hits / Misses • Each change to cache will impact the hit time and miss rate in some way
Reactive-Associative Caches Joseph Schmigel
Reactive-Associative Caches • Attempts to combine direct-mapped and set-associative cache • Goal is to decrease miss rate while keeping hit times similar to direct-mapped • Avoid disadvantages of each • Direct-mapped has high miss rate • Set-associative has high hit time • Several major parts: • Data array • Tag array • Probes • Way Prediciton • Feedback
Data Array & Tag Arrays • The Data Array is the actual cache that stores data • Data Array has two address mappings, one that is direct-mapped, and one that is set-associative (usually 2, 4, or 8 ways) • The tag array has n tag banks where n is the number of ways. • The tag array is used to store the tags of each set-associative index. • Each tag bank is searched in parallel.
Probes • Two probes (Probe0 & Probe1) are used to signal a hit. • Probe0 performs three steps in parallel • Looks for a direct-mapped hit • Uses way-predicted to find hit • Finds hit in tag array • Probe0 tries to keep hit time equal to that of a direct-mapped hit time – only fails if has to use tag array • Probe1 is only used if Probe0 does not find a direct-mapped hit or way-predicted hit. It then returns a hit if there is a correct match in the set-associative cache.
Probes continued • This means that the following possibilities exist: • Probe0 hits on direct-mapped and Probe 1 is ignored • Probe0 hits on way-prediction and Probe1 is ignored • Probe0 hits using tag array and the Probe1 hits using way found from tag array • Probe0 misses and Probe1 is ignored
Way Prediction • Allows the block to be accessed without performing a tag lookup to obtain the way • Keeps hit times comparable to that of direct-mapped • Must be performed early enough so data can be ready in time for pipeline stage that needs it • Prediction can only use information that is currently available in pipeline • Two types of way prediction were used – XOR and Program Counter
XOR Way Prediction • Calculates the approximate data access by XOR’ing the register value with the instruction offset • Works by assuming that the small memory offsets that are pretty common can be XOR’ed and get a reliable block address to use as a prediction • Cannot be done until late in the pipeline because the registers need to be loaded before performing calculation • More accurate then program counter way prediction
Program Counter Way Prediction • Associates parts of the cache with the program counter • Not as accurate as XOR since the program counter does not access the same memory location all times • Program counter is calculated early in the pipeline so it is easier to make the predictions • Not as accurate as XOR
Feedback • 3 types of feedback • Reactive displacement • Eviction of unpredictable blocks • Eviction of hard to predict blocks • Feedback tries to maximize bandwidth and minimize hit latency • Highly predictable blocks are used in the set-associative cache • Blocks that can not be predicted reliably are kept in direct-mapped cache
Non-Uniform Cache Architectures James Brock
Cache Organization • Multiple Layer Cache • Hierarchical organization designed for faster accesses to layers of cache closer to the core • Replacement policies are static • i.e. – Replacements cause one insertion, one eviction at the same location in cache • Uniform Cache • Cache Architecture is physically laid out in uniformly distributed banks and sub-banks.
Problem Domain • CPU’s are becoming wire-delay dominated • As the core speed of CPU’s increases, the latency of transmission delays has a greater effect on overall performance • 2 possible paths • Reduce the latency of wire traces (physical limitations) • Use latency in the design, and optimize
Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) • All designs were modeled for L2 cache, but can be scaled to work as any layer • Uniform cache latency is as fast as the slowest bank • Non-uniformity in cache uses the latency of (sub)banks further from the decoder for better performance • S-NUCA • Static means that the data in main memory is mapped to 1 … n locations in cache, where n = associativity.
Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) S-NUCA2 S-NUCA1
Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) • S-NUCA 1 • Individual data and address channels for each bank • Multiple banks can be accessed in parallel • HUGE real estate cost to add channels for each bank • S-NUCA 2 • Mesh grid of data and address channels • Switches at each intersection access multiple sub-banks in parallel and arbitrate data flow
Solution 2: Dynamic NUCA(D-NUCA) • Dynamic refers to the ranking and movement of cache lines within the banks and sub-banks • Replacement policy is not a insert & evict • Insertion, Demotion, Eviction based on the replacement heuristic • Example least recently used! • With D-NUCA, mapping, searching, and line movement problems expand
D-NUCA Mapping & Searching • Uses spread sets of banks • # of banks in a set = associativity of the cache • Simple Mapping • Search by set, bank, tags within the set • Some sets are further then others, rows may not be desired number of ways • Fair Mapping • Fixes problems in simple mapping, but more complex • Equal access times to all banks
D-NUCA Mapping & Searching • Shared Mapping • Closest banks are shared with the farthest set • If n sets share a bank, then all banks in the cache are n-way associative • Slightly higher bank associativity offsets average access latency • Cache lines from farther bank sets are located right next to cache controller
D-NUCA Mapping & Searching • Locating cache lines • Incremental Search – one bank at a time • Low power, less messages on cache network • Low performance • Multicast Search – some/all banks at the same time • More power, more network contention • Faster hits to farther banks
D-NUCA Mapping & Searching • Hybrid Searches – combos! • Limited Multicast • Multicast of M banks in each bank set in parallel • M < N • Partitioned Multicast • Similar to multi-level set-associative caches • Each bank set is broken up into subsets • Multicast searches are performed on each subset, starting with the closest subset
D-NUCA Line Movement • Goal of D-NUCA is to maximize hits in the closest banks • LRU policy is applied to mapping lines within a bank • MRU lines is closest to the cache controller • Replacement Policy – Generational Promotion • A cache hit causes that line to be moved one line closer to the cache controller
D-NUCA Line Movement • Generational Promotion (cont’d) • More heavily used lines, thus migrate towards the cache controller • Eviction / Insertion policy shouldn’t simply eject the LRU line and insert the new line in that spot • New lines are inserted towards the middle of the bank set, and allowed to progress forward or back • The victim line can be evicted or simply demoted, with a less important line being evicted.
Conclusion • Cache improvements are often more work then the benefits they offer • Complexity causes speed decrease which limits usefulness • Implementing complex caching structure does not usually provide a good cost/benefit ratio for companies • Research still being done and are useful in the theoretical world
References [1] Changkyu Kim, Doug Burger, Stephen Keckler. \textbf{An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches}. Computer Architecture and Technology Laboratory, U of Texas, Austin. [2] http://en.wikipedia.org/wiki/CPUcache [3] B. Batson, and TN. Vijaykumar. Reactive associative caches. In Int. Conf. on Parallel Architectures and Compilation Techniques, Sep. 2001. [4] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Apporach. Morgan Kaufmann, 2003. Third Edition, Chapter Five.
References [5] B. Calder, D. Grunwald, and J. Ermer. Predictive sequential associative cache. In Proceedings of the Second IEEE Symposium on High-Performance Computer Architecture, Feb. 1996 [6] B. Calder, and D. Grunwald. Next cache line and set prediction. In Proceedings of the 20th International Symposium n Computer Architecture, June 1995 [7] A. Agarwal and S. Pudar. Column associative caches: A technique for reducing miss rate of direct-mapped caches. In Proceedings of the 20th International Symposium n Computer Architecture, May 1993