500 likes | 510 Views
Explore the fundamentals of cache memory, including cache writes, DRAM configurations, and performance optimization through spatial and temporal locality exploitation. Learn about associative and multilevel caches in the memory hierarchy, and how to design caches based on program behavior. Discover the impact of locality on cache design and the benefits of direct-mapped caching. Dive into cache search mechanisms and the definitions of cache-related terms. Gain insights into cache operations, memory hierarchy levels, and cache write strategies for efficient data handling.
E N D
Caching – Chapter 7 • Basics (7.1,7.2) • Cache Writes (7.2 - p 483-485) • DRAM configurations (7.2 – p 487-491) • Performance (7.3) • Associative caches (7.3 – p 496-504) • Multilevel caches (7.3 – p 505-510)
Memory Hierarchy Size Smallest Largest Cost/bit Highest Lowest Tech SRAM (logic) SRAM (logic) DRAM (capacitors) Speed Fastest Slowest CPU L1 L2 Cache DRAM
Program Characteristics • Temporal Locality • Spatial Locality
Locality • Programs tend to exhibit spatial & temporal locality. Just a fact of life. • How can we use this knowledge of program behavior to design a cache?
What does that mean?!? • 1. Architect: Design cache that ______________ spatial & temporal locality • 2. Programmer: When you program, __________ ________________________________ locality • Java - difficult to do • C - more control over data placement • Note: Caches exploit locality. Programs have varying degrees of locality. Caches do not have locality!
Exploiting Spatial & Temporal Locality • 1. Cache: • 2. Program:
What do we put in cache? • Temporal Locality • Spatial Locality
Where do we put data in cache? • Searching whole cache takes time & power • Direct-mapped • Limit each piece of data to one possible position • Search is quick and simple
Direct-Mapped One block (line) blocksize (linesize) = 2 words wordsize = 4 bytes 000000 000100 Index 010000 00 010100 01 10 11 100000 100100 Cache 110000 110100 Memory
How do we find data in cache?Direct-mappedBlock (Line) size = 2 words(8 bytes) Byte Address 0b100100100 Index Data 00 01 10 11 Where do we look in the cache? How do we know if it is there?
Example 2 - Block size=2 words Direct-Mapped Cache Valid Tag Data 0b1010001 00 0 01 0 10 0 Tag 11 0 Index Block Offset Byte Offset
Definitions • Byte Offset: Which _____ within _____? • Block Offset: Which _____ within ______? • Set: Group of ______ checked each access • Index: Which ______ within cache? • Tag: Is this the right one?
Definitions • Block (Line) • Hit • Miss • Hit time / Access time • Miss Penalty
Example 1 – Direct-MappedBlock size=2 words Reference Stream: Hit/Miss 0b1001000 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Direct-Mapped Cache Valid Tag Data 00 0 01 0 10 0 11 0 Tag Byte Offset Index Block Offset Miss Rate:
Implementation Byte Address 0b100100100 Byte Offset Tag Index Block offset Valid Tag Data 00 01 10 11 = MUX Hit? Data
Example 2 • You are implementing a 64-Kbyte cache • The block size (line size) is 16 bytes. • Each word is 4 bytes • How many bits is the block offset? • How many bits is the index? • How many bits is the tag?
How caches work • Classic abstraction • Each level of hierarchy has no knowledge of the configuration of lower level L2 cache’s perspective L1 cache’s perspective Me Me L1 L2 Cache Memory Memory L2 Cache DRAM DRAM
Memory operation at any level Address Data 1. 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive data update cache return data 5. 2. Me L1 3. 4. Memory
Address Data 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive block update cache return data Access Time Me L1 Memory Miss Penalty
Performance • Hit: latency = ____________ • Miss: latency = _____________________ • Goal: minimize misses!!!
Cache Writes • There are multiple copies of the data lying around • L1 cache, L2 cache, DRAM • Do we write to all of them? • Do we wait for the write to complete before the processor can proceed?
Do we write to all of them? • Write-through • Write-back • creates data - different values for same item in cache and DRAM. • This data is referred to as
Write-Through vs. Write-Back Sw $3, 0($5) CPU CPU L1 L1 L2 Cache L2 Cache DRAM DRAM
Which performs the write faster? Which has faster evictions from a cache? Which causes more bus traffic? Write-through vs Write-back
Does processor wait for write? • Write buffer • Any loads must check write buffer in parallel with cache access. • Buffer values are more recent than cache values.
Challenge • DRAM is designed for _______, not _____ • DRAM is _______ than the bus • We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. • Widening anything increases the cost by quite a bit.
Narrow Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM
Wide Configuration CPU • Given: • 1 clock cycle request • 15 cycles / 2 words DRAM latency • 1 cycle / 2 words bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM
Interleaved Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM DRAM
Recent DRAM trends • Fewer, Bigger DRAMs • New bus protocols (RAMBUS) • small DRAM caches (page mode) • SDRAM (synchronous DRAM) • one request & length nets several continuous responses.
Performance • Execute Time = (Cpu cycles + Memory-stall cycles) * clock cycle time • Memory-stall cycles = • accesses * misses * cycles = • program access miss • memory access * Miss rate * Miss penalty • program • instructions * misses * cycles = • program inst miss • instructions * misses * miss penalty • program inst
Example 1 • instruction cache miss rate: 2% • data cache miss rate: 3% • miss penalty: 50 cycles • ld/st instructions are 25% of instructions • CPI with perfect cache is 2.3 • How much faster is the computer with a perfect cache?
Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now?
Example 1 Work • misses = Iacc * Imr + Dacc * Dmr • instr instr instr
Direct-mapped CacheBlocksize=2words, wordsize= 4bytes Valid Tag Data 00 1 101 01 1 010 10 1 000 11 0 000 Reference Stream: Hit/Miss 0b00111000 0b00011100 0b00111000 0b00011000 Byte Offset Tag Index Block Offset
PROBLEM Conflicting addresses cause high miss rates Size was NOT the problem!!! SOLUTION Relax direct-mapping Direct-Mapped Caches
Cache Configurations Direct-Mapped Valid Tag Data 00 01 10 11 2-way Associative - each set has two blocks Valid Tag Data Valid Tag Data 0 1 Fully Associative - all addresses map to the same set Valid Tag Data Valid Tag Data
2-way Set Associative CacheBlocksize=2words, wordsize= 4bytes Index Valid Tag Data Valid Tag Data 0 1 1 1001 0000 1 1 1 0010 0001 Reference Stream: Hit/Miss 0b00111000 0b00011100 0b00111000 0b00011000 Byte Offset Tag Index Block Offset
Implementation Byte Address 0x100100100 Byte Offset Tag Index Block offset Valid Tag Data Valid Tag Data 0 1 = MUX = MUX Hit? Data MUX
Performance Implications • Increasing associativity increases/decreases hit rate • Increasing associativity increases/decreases access time • Increasing associativity increases/decreases miss penalty
Example 2-way associative Reference Stream: Hit/Miss 0b1001000 M 0b0011100 0b1001000 0b0111000 Direct-Mapped Cache Valid Tag Data 0 0 0 1 0 0 Tag Byte Offset Index Block Offset Miss Rate:
Which block to replace? • 0b1001000 • 0b001100
Replacement Algorithms • LRU & FIFO simple conceptually, but implementation difficult for high assoc. • LRU & FIFO must be approximated with high associativity • Random sometimes better than approximated LRU/FIFO • Tradeoff between accuracy, implementation cost
L1 cache’s perspective Me L1 L1’s miss penalty contains the access of L2, and possibly the access of DRAM!!! Memory L2 Cache DRAM
Multi-level Caches • Base CPI 1.0, 500MHz clock • main memory-100 cycles, L2 - 10 cycles • L1 miss rate per instruction - 5% • w/L2 - 2% of instructions go to DRAM • What is the speedup with the L2 cache? There is a typo in the book for this example!
Multi-level Caches • CPI = 1 + memory stalls / instruction
Summary • Direct-mapped • simple • _____ access time • _______ hit rate • Variable block size • still simple • _______ access time
Summary • Associative caches • ________ the access time • ________ the hit rate • associativity above ___ has little to no gain • Multi-level caches • __________ potential miss penalty • __________ average miss penalty