1.06k likes | 1.08k Views
Understand the different cache configurations and their impact on performance. Learn about write-through and write-back strategies, and their effects on data consistency and bus traffic.
E N D
Outline • Cache writes • DRAM configurations • Performance • Associative caches • Multi-level caches
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] Not Valid 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] 00 1 01 01 1 00 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] 00 1 01 01 1 00 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] M[48-63] 00 1 01 01 1 00 10 1 00 11 1 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 M 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] M[48-63] 00 1 01 01 1 00 10 1 00 11 1 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 M 0b00010000 Byte Offset Tag Index Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] M[48-63] 00 1 01 01 1 00 10 1 00 11 1 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 M 0b00010000 H Byte Offset Tag Index Block Offset
Cache Writes • There are multiple copies of the data lying around • L1 cache, L2 cache, DRAM • Do we write to all of them? • Do we wait for the write to complete before the processor can proceed?
Do we write to all of them? • Write-through • Write-back • creates data - different values for same item in cache and DRAM. • This data is referred to as
Do we write to all of them? • Write-through - write to all levels of hierarchy • Write-back • creates data - different values for same item in cache and DRAM. • This data is referred to as
Do we write to all of them? • Write-through - write to all levels of hierarchy • Write-back - write to lower level only when cache line gets evicted from cache • creates inconsistent data - different values for same item in cache and DRAM – stale data. • Inconsistent data in highest level in cache is referred to as dirty • If they all match, they are clean • The old data is stale.
Write-Through Sw $3, 0($5) CPU L1 L2 Cache DRAM
Write-Back Sw $3, 0($5) CPU L1 L2 Cache DRAM
Which performs the write faster? Which has faster evictions from a cache? Which causes more bus traffic? Write-through vs Write-back
Which performs the write faster? Write-back - it only writes the L1 cache Which has faster evictions from a cache? Which causes more bus traffic? Write-through vs Write-back
Which performs the write faster? Write-back - it only writes the L1 cache Which has faster evictions from a cache? Write-through - no write involved, just overwrite tag Which causes more bus traffic? Write-through vs Write-back
Which performs the write faster? Write-back - it only writes the L1 cache Which has faster evictions from a cache? Write-through - no write involved, just overwrite tag Which causes more bus traffic? Write-through. DRAM is written every store. Write-back only writes on eviction. Write-through vs Write-back
Does processor wait for write? • Write buffer • Any loads must check write buffer in parallel with cache access. • Buffer values are more recent than cache values.
Does processor wait for write? • Write buffer - intermediate queue for pending writes • Any loads must check write buffer in parallel with cache access. • Buffer values are more recent than cache values.
Outline • Cache writes • DRAM configurations • Performance • Associative caches
Challenge • DRAM is designed for density, not speed • DRAM is ______ than the bus • We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. • Widening anything increases the cost by quite a bit.
Challenge • DRAM is designed for density, not speed • DRAM is slower than the bus • We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. • Widening anything increases the cost by quite a bit.
Narrow Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM
Narrow Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? • 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles Cache Bus DRAM
Wide Configuration CPU • Given: • 1 clock cycle request • 15 cycles / 2 words DRAM latency • 1 cycle / 2 words bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM
Wide Configuration CPU • Given: • 1 clock cycle request • 15 cycles / 2 words DRAM latency • 1 cycle / 2 words bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? • 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles Cache Bus DRAM
Interleaved Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM DRAM
Interleaved Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? • 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles Cache Bus DRAM DRAM
Recent DRAM trends • Fewer, Bigger DRAMs • New bus protocols (RAMBUS) • small DRAM caches (page mode) • SDRAM (synchronous DRAM) • one request & length nets several continuous responses.
Outline • Cache writes • DRAM configurations • Performance • Associative caches
Performance • Execute Time = (Cpu cycles + Memory-stall cycles) * clock cycle time • Memory-stall cycles = • accesses * misses * cycles = • program access miss • memory access * Miss rate * Miss penalty • program • instructions * misses * cycles = • program inst miss • instructions * misses * miss penalty • program inst
Example 1 • instruction cache miss rate: 2% • data cache miss rate: 3% • miss penalty: 50 cycles • ld/st instructions are 25% of instructions • CPI with perfect cache is 2.3 • How much faster is the computer with a perfect cache?
Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr
Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275
Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375
Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375 • ExecT = (Cpu CPI * I + MemCycles)*Clk
Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375 • ExecT = (Cpu CPI * I + MemCycles)*Clk • = (2.3 * I + 1.375 * I) * clk = 3.675IC
Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375 • ExecT = (Cpu CPI * I + MemCycles)*Clk • = (2.3 * I + 1.375 * I) * clk = 3.675IC • speedup = 3.675 IC / 2.3IC = 1.6
Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now?
Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now? 100 cycles • Memory cycles =
Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now? 100 cycles • Memory cycles = I * .0275 * 100 = I * 2.75