10/18: Lecture Topics

10/18: Lecture Topics • Using spatial locality • memory blocks • Write-back vs. write-through • Types of cache misses • Cache performance • Cache tradeoffs • Cache summary

Locality • Temporal locality: the principle that data being accessed now will probably be accessed again soon • Useful data tends to continue to be useful • Spatial locality: the principle that data near the data being accessed now will probably be needed soon • If data item n is useful now, then it’s likely that data item n+1 will be useful soon

Memory Access Patterns • Memory accesses don’t look like this • random accesses • Memory accesses do look like this • hot variables • step through arrays

Locality • Last time, we improved memory performance by taking advantage of temporal locality • When a word in memory was accessed we loaded it into memory • This does nothing for spatial locality

Possible Spatial Locality Solution • Store one word per cache line • When memory word N is accessed, load word N, word N+1 and N+2 … into the cache • This is called prefetching • What’s a drawback? Example: What if we access the word at address 1000100?

Block 0 0x00000000–0x0000003F Block 1 0x00000040–0x0000007F Block 2 0x00000080–0x000000BF Memory Blocks • Divide memory into blocks • If any word in a block is accessed, then load an entire block into the cache Cache line for 16 word block size

Address Tags Revisited • A cache block size > 1 word requires the address to be divided differently • Instead of a byte offset into a word, we need a byte offset into the block • Assuming we had 10-bit addresses, and 4 words in a block…

Cache Diagram • Cache with a block size of 4 words What does the cache look like after accesses to these 10 bit addresses? 1000010010, 1100110011

Cache Lookup • 32 bit addr., 64K DM cache, 4 words/block ref. address 10000110 11101010 10000101 11101010 Index into cache Cache hit; select word yes Is valid bit on? Do tags match? yes return data no no Cache miss; access memory

Cache Example • Suppose the L1 cache is 32KB, 2-way set associative and has 8 words per block, how do we partition the 32-bit address? • How many bits for the block offset? • How many bits for the index? • How many bits for the tag?

The Effects of Block Size • Big blocks are good • Fewer first time misses • Exploits spatial locality • Small blocks are good • Don’t evict so much other data when bringing in a new entry • More likely that all items in the block will turn out to be useful • How do you choose a block size?

Reads vs. Writes • Caching is essentially making a copy of the data • When you read, the copies still match when you’re done • When you write, the results must eventually propagate to both copies • Especially at the lowest level, which is in some sense the permanent copy

Write-Through Caches • Write the update to the cache and the memory immediately • Advantages: • The cache and the memory are always consistent • Evicting a cache line is cheap because no data needs to be written back • Easier to implement • Disadvantages?

Write-Back Caches • Write the update to the cache only. Write to the memory only when the cache block is evicted. • Advantages: • Writes go at cache speed rather than memory speed. • Some writes never need to be written to the memory. • When a whole block is written back, can use high bandwidth transfer. • Disadvantages?

Dirty bit • When evicting a block from a write-back cache, we could • always write the block back to memory • write it back only if we changed it • Caches use a “dirty bit” to mark if a line was changed • the dirty bit is 0 when the block is loaded • it is set to 1 if the block is modified • when the line is evicted, it is written back only if the dirty bit is 1

Dirty Bit Example • Use the dirty bit to determine when evicted cache lines need to be written back to memory $r2 = Mem[10010000] Mem[10010100] = 10 $r3 = Mem[11010100] $r4 = Mem[11011000] Mem[01010000] = 10 • Assume 8 bit addresses • Assume all memory words are initialized to 7

i-Cache and d-Cache • There usually are two separate caches for instructions and data. Why? • Avoids structural hazards in pipelining • The combined cache is twice as big but still has an access time of a small cache • Allows both caches to operate in parallel, for twice the bandwidth

Handling i-Cache Misses • Stall the pipeline and send the address of the missed instruction to the memory • Instruct memory to perform a read; wait for the access to complete 3. Update the cache 4. Restart the instruction, this time fetching it successfully from the cache d-Cache misses are even easier, but still require a pipeline stall

Cache Replacement • How do you decide which cache block to replace? • If the cache is direct-mapped, it’s easy • Otherwise, common strategies: • Random • Least Recently Used (LRU) • Other strategies are used at lower levels of the hierarchy. More on those later.

LRU Replacement • Replace the block that hasn’t been used for the longest time. Reference stream: A B C D B D E B A C B C E D C B

LRU Implementations • LRU is very difficult to implement for high degrees of associativity • 4-way approximation: • 1 bit to indicate least recently used pair • 1 bit per pair to indicate least recently used item in this pair • Much more complex approximations at lower levels of the hierarchy

The Three C’s of Caches • Three reasons for cache misses: • Compulsory miss: item has never been in the cache • Capacity miss: item has been in the cache, but space was tight and it was forced out (occurs even with fully associative caches) • Conflict miss: item was in the cache, but the cache was not associative enough, so it was forced out (never occurs with fully associative caches)

Eliminating Cache Misses • What cache parameters (cache size, block size, associativity) can you change to eliminate the following kinds of misses • compulsory • capacity • conflict

Multi-Level Caches • Use each level of the memory hierarchy as a cache over the next lowest level • Inserting level 2 between levels 1 and 3 allows: • level 1 to have a higher miss rate (so can be smaller and cheaper) • level 3 to have a larger access time (so can be slower and cheaper) • The new effective access time equation:

Which cache system is better? • 32 KB unified data and instruction cache • hit rate of 97% • 16 KB data cache • hit rate of 92% • And 16 KB instruction cache • hit rate of 98% • Assume • 20% of instructions are loads or stores

Cache Parameters and Tradeoffs • If you are designing a cache, what choices do you have and what are their tradeoffs?

Cache Comparisons L1 i-Cache L1 d-Cache L2 unified Cache

Summary: Classifying Caches • Where can a block be placed? • Direct mapped, Set/Fully associative • How is a block found? • Direct mapped: by index • Set associative: by index and search • Fully associative: by search • What happens on a write access? • Write-back or Write-through • Which block should be replaced? • Random • LRU (Least Recently Used)

10/18: Lecture Topics

10/18: Lecture Topics

Presentation Transcript