CS 7960-4 Lecture 10

CS 7960-4 Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings of ISCA-17 1990

Cache Basics Tag array Data array D E C O D E R Way 1 Way 2 Set Comparator Address Mux

Multiplexing M

Banking Words/Ways get distributed Sets get distributed Wordline Bitline • Banking reduces acces time per bank and overall power • Allows multiple accesses without true multiporting

Virtual Memory • A single physical address (A) can map to multiple • virtual addresses (X, Y) • The CPU provides addresses X and Y and the • cache must make sure that both map to the same • cache location • Naive solution: perform virtual-to-physical • translation (TLB) before accessing the cache

Page Coloring • To identify potential cache locations and initiate • the RAM look-up, only index bits are needed • If OS ensures that virtual index bits always match • physical index bits, you can start RAM look-up • before completing TLB look-up • When both finish, use newly obtained physical • address for the tag comparison (note: can’t use • virtual address for tag comparison • Virtually-indexed, Physically-tagged

Memory Wall Year : Clock speed : Memory latency in seconds : in cycles : 1997 0.75 GHz 50+20ns 53 cycles 2011 10 GHz 16ns 160 cycles Improves by 10%/year Clock speed has traditionally improved by 50%/year, but will improve by only ~20%/year in the future

Bottlenecks

Conflict Misses • Direct-mapped caches have lower access times, • but suffer from conflict misses • Most conflict misses are localized to a few sets • -- an associativity of 1.2 is desirable?

Victim Caches • Every eviction from L1 gets put in the victim cache • (VC and L1 are exclusive) • Victim cache associative look-up can happen in • parallel with L1 look-up – VC hit results in a swap L1 Victim cache

Results • The cache and line size influence the percentage • of misses attributable to conflicts • 15-entry victim cache eliminates half the conflict • misses – reduction in total cache misses is less • than 20%

Prefetch Techniques • Prefetch on miss fetches multiple lines for every • cache miss • Tagged prefetch waits till a prefetched line is • touched before bringing in more lines • Prefetch deals with capacity and compulsory • misses, but causes cache pollution

Stream Buffers • On a cache miss, fill the stream buffer with • contiguous cache lines • When you read the top of the queue, bring in the • next line • If the top-of-q does not service a miss, the stream • buffer flushes and starts from scratch L1 Sequential lines Stream buffer

Results • Eight entries are enough to eliminate most capacity • and compulsory misses • 72% of I-cache misses and 25% of D-cache misses • are eliminated • Multiple stream buffers help eliminate 43% of • D-cache misses • Large cache lines minimize stream buffer impact • (stream buffer removes 10% of D-cache misses for • 128B cache line size)

Potential Improvements • Relax the top-of-q constraint for the stream buffer • Maintain a stride value to detect non-sequential • accesses

Bottlenecks Again For 4KB caches, 16B lines

Harmonic and Arithmetic Means • HM of IPC = N / (1/IPCa + 1/ IPCb + 1/ IPCc) • = N / (CPIa + CPIb + CPIc) • = 1 / AM of CPI • Weight each benchmark as if they all execute one • instruction • If you want to assume each benchmark executes • for the same time, HM of CPI or AM of IPC is • appropriate

Next Week’s Paper • “Memory Dependence Prediction Using Store • Sets”, Chrysos and Emer, ISCA-25, 1998

Title • Bullet

CS 7960-4 Lecture 10

CS 7960-4 Lecture 10

Presentation Transcript

CS 194: Lecture 10

CS 7960-4 Lecture 20

CS 7960-4 Lecture 24

CS 7960-4 Lecture 8

CS 7960-4 Lecture 5

CS 140L Lecture 4

CS 140L Lecture 4

CS 425 Lecture 4

CS 7960-4 Lecture 23

CS 7960-4 Lecture 2

CS 7960-4 Lecture 17

CS 7960-4 Lecture 7

CS 7960-4 Lecture 20

CS 7960-4 Lecture 4

CS 7810 Lecture 10

CS 7960-4 Lecture 20

CS 194: Lecture 10

CS 160: Lecture 10

CS 7960-4 Lecture 14

CS 7960-4 Lecture 18