Cache Configurations and Performance: Write-Through vs Write-Back

Outline • Cache writes • DRAM configurations • Performance • Associative caches • Multi-level caches

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] Not Valid 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] 00 1 01 01 1 00 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] M[48-63] 00 1 01 01 1 00 10 1 00 11 1 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 M 0b00010000 Byte Offset Tag Index Block Offset

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] M[48-63] 00 1 01 01 1 00 10 1 00 11 1 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 M 0b00010000 H Byte Offset Tag Index Block Offset

Cache Writes • There are multiple copies of the data lying around • L1 cache, L2 cache, DRAM • Do we write to all of them? • Do we wait for the write to complete before the processor can proceed?

Do we write to all of them? • Write-through • Write-back • creates data - different values for same item in cache and DRAM. • This data is referred to as

Do we write to all of them? • Write-through - write to all levels of hierarchy • Write-back • creates data - different values for same item in cache and DRAM. • This data is referred to as

Do we write to all of them? • Write-through - write to all levels of hierarchy • Write-back - write to lower level only when cache line gets evicted from cache • creates inconsistent data - different values for same item in cache and DRAM – stale data. • Inconsistent data in highest level in cache is referred to as dirty • If they all match, they are clean • The old data is stale.

Write-Through Sw $3, 0($5) CPU L1 L2 Cache DRAM

Write-Back Sw $3, 0($5) CPU L1 L2 Cache DRAM

Which performs the write faster? Which has faster evictions from a cache? Which causes more bus traffic? Write-through vs Write-back

Which performs the write faster? Write-back - it only writes the L1 cache Which has faster evictions from a cache? Which causes more bus traffic? Write-through vs Write-back

Which performs the write faster? Write-back - it only writes the L1 cache Which has faster evictions from a cache? Write-through - no write involved, just overwrite tag Which causes more bus traffic? Write-through vs Write-back

Which performs the write faster? Write-back - it only writes the L1 cache Which has faster evictions from a cache? Write-through - no write involved, just overwrite tag Which causes more bus traffic? Write-through. DRAM is written every store. Write-back only writes on eviction. Write-through vs Write-back

Does processor wait for write? • Write buffer • Any loads must check write buffer in parallel with cache access. • Buffer values are more recent than cache values.

Does processor wait for write? • Write buffer - intermediate queue for pending writes • Any loads must check write buffer in parallel with cache access. • Buffer values are more recent than cache values.

Outline • Cache writes • DRAM configurations • Performance • Associative caches

Challenge • DRAM is designed for density, not speed • DRAM is ______ than the bus • We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. • Widening anything increases the cost by quite a bit.

Challenge • DRAM is designed for density, not speed • DRAM is slower than the bus • We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. • Widening anything increases the cost by quite a bit.

Narrow Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

Narrow Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? • 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles Cache Bus DRAM

Wide Configuration CPU • Given: • 1 clock cycle request • 15 cycles / 2 words DRAM latency • 1 cycle / 2 words bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

Wide Configuration CPU • Given: • 1 clock cycle request • 15 cycles / 2 words DRAM latency • 1 cycle / 2 words bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? • 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles Cache Bus DRAM

Interleaved Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM DRAM

Interleaved Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? • 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles Cache Bus DRAM DRAM

Recent DRAM trends • Fewer, Bigger DRAMs • New bus protocols (RAMBUS) • small DRAM caches (page mode) • SDRAM (synchronous DRAM) • one request & length nets several continuous responses.

Outline • Cache writes • DRAM configurations • Performance • Associative caches

Performance • Execute Time = (Cpu cycles + Memory-stall cycles) * clock cycle time • Memory-stall cycles = • accesses * misses * cycles = • program access miss • memory access * Miss rate * Miss penalty • program • instructions * misses * cycles = • program inst miss • instructions * misses * miss penalty • program inst

Example 1 • instruction cache miss rate: 2% • data cache miss rate: 3% • miss penalty: 50 cycles • ld/st instructions are 25% of instructions • CPI with perfect cache is 2.3 • How much faster is the computer with a perfect cache?

Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr

Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275

Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375

Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375 • ExecT = (Cpu CPI * I + MemCycles)*Clk

Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375 • ExecT = (Cpu CPI * I + MemCycles)*Clk • = (2.3 * I + 1.375 * I) * clk = 3.675IC

Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375 • ExecT = (Cpu CPI * I + MemCycles)*Clk • = (2.3 * I + 1.375 * I) * clk = 3.675IC • speedup = 3.675 IC / 2.3IC = 1.6

Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now?

Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now? 100 cycles • Memory cycles =

Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now? 100 cycles • Memory cycles = I * .0275 * 100 = I * 2.75

Cache Configurations and Performance: Write-Through vs Write-Back

Cache Configurations and Performance: Write-Through vs Write-Back

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: