190 likes | 208 Views
CS 7960-4 Lecture 10. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings of ISCA-17 1990. Cache Basics. Tag array. Data array. D E C O D E R. Way 1. Way 2. Set. Comparator. Address. Mux.
E N D
CS 7960-4 Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings of ISCA-17 1990
Cache Basics Tag array Data array D E C O D E R Way 1 Way 2 Set Comparator Address Mux
Banking Words/Ways get distributed Sets get distributed Wordline Bitline • Banking reduces acces time per bank and overall power • Allows multiple accesses without true multiporting
Virtual Memory • A single physical address (A) can map to multiple • virtual addresses (X, Y) • The CPU provides addresses X and Y and the • cache must make sure that both map to the same • cache location • Naive solution: perform virtual-to-physical • translation (TLB) before accessing the cache
Page Coloring • To identify potential cache locations and initiate • the RAM look-up, only index bits are needed • If OS ensures that virtual index bits always match • physical index bits, you can start RAM look-up • before completing TLB look-up • When both finish, use newly obtained physical • address for the tag comparison (note: can’t use • virtual address for tag comparison • Virtually-indexed, Physically-tagged
Memory Wall Year : Clock speed : Memory latency in seconds : in cycles : 1997 0.75 GHz 50+20ns 53 cycles 2011 10 GHz 16ns 160 cycles Improves by 10%/year Clock speed has traditionally improved by 50%/year, but will improve by only ~20%/year in the future
Conflict Misses • Direct-mapped caches have lower access times, • but suffer from conflict misses • Most conflict misses are localized to a few sets • -- an associativity of 1.2 is desirable?
Victim Caches • Every eviction from L1 gets put in the victim cache • (VC and L1 are exclusive) • Victim cache associative look-up can happen in • parallel with L1 look-up – VC hit results in a swap L1 Victim cache
Results • The cache and line size influence the percentage • of misses attributable to conflicts • 15-entry victim cache eliminates half the conflict • misses – reduction in total cache misses is less • than 20%
Prefetch Techniques • Prefetch on miss fetches multiple lines for every • cache miss • Tagged prefetch waits till a prefetched line is • touched before bringing in more lines • Prefetch deals with capacity and compulsory • misses, but causes cache pollution
Stream Buffers • On a cache miss, fill the stream buffer with • contiguous cache lines • When you read the top of the queue, bring in the • next line • If the top-of-q does not service a miss, the stream • buffer flushes and starts from scratch L1 Sequential lines Stream buffer
Results • Eight entries are enough to eliminate most capacity • and compulsory misses • 72% of I-cache misses and 25% of D-cache misses • are eliminated • Multiple stream buffers help eliminate 43% of • D-cache misses • Large cache lines minimize stream buffer impact • (stream buffer removes 10% of D-cache misses for • 128B cache line size)
Potential Improvements • Relax the top-of-q constraint for the stream buffer • Maintain a stride value to detect non-sequential • accesses
Bottlenecks Again For 4KB caches, 16B lines
Harmonic and Arithmetic Means • HM of IPC = N / (1/IPCa + 1/ IPCb + 1/ IPCc) • = N / (CPIa + CPIb + CPIc) • = 1 / AM of CPI • Weight each benchmark as if they all execute one • instruction • If you want to assume each benchmark executes • for the same time, HM of CPI or AM of IPC is • appropriate
Next Week’s Paper • “Memory Dependence Prediction Using Store • Sets”, Chrysos and Emer, ISCA-25, 1998
Title • Bullet