1 / 50

Understanding Cache Basics: Spatial & Temporal Locality Exploitation

Explore the fundamentals of cache memory, including cache writes, DRAM configurations, and performance optimization through spatial and temporal locality exploitation. Learn about associative and multilevel caches in the memory hierarchy, and how to design caches based on program behavior. Discover the impact of locality on cache design and the benefits of direct-mapped caching. Dive into cache search mechanisms and the definitions of cache-related terms. Gain insights into cache operations, memory hierarchy levels, and cache write strategies for efficient data handling.

Download Presentation

Understanding Cache Basics: Spatial & Temporal Locality Exploitation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Caching – Chapter 7 • Basics (7.1,7.2) • Cache Writes (7.2 - p 483-485) • DRAM configurations (7.2 – p 487-491) • Performance (7.3) • Associative caches (7.3 – p 496-504) • Multilevel caches (7.3 – p 505-510)

  2. Memory Hierarchy Size Smallest Largest Cost/bit Highest Lowest Tech SRAM (logic) SRAM (logic) DRAM (capacitors) Speed Fastest Slowest CPU L1 L2 Cache DRAM

  3. Program Characteristics • Temporal Locality • Spatial Locality

  4. Locality • Programs tend to exhibit spatial & temporal locality. Just a fact of life. • How can we use this knowledge of program behavior to design a cache?

  5. What does that mean?!? • 1. Architect: Design cache that ______________ spatial & temporal locality • 2. Programmer: When you program, __________ ________________________________ locality • Java - difficult to do • C - more control over data placement • Note: Caches exploit locality. Programs have varying degrees of locality. Caches do not have locality!

  6. Exploiting Spatial & Temporal Locality • 1. Cache: • 2. Program:

  7. What do we put in cache? • Temporal Locality • Spatial Locality

  8. Where do we put data in cache? • Searching whole cache takes time & power • Direct-mapped • Limit each piece of data to one possible position • Search is quick and simple

  9. Direct-Mapped One block (line) blocksize (linesize) = 2 words wordsize = 4 bytes 000000 000100 Index 010000 00 010100 01 10 11 100000 100100 Cache 110000 110100 Memory

  10. How do we find data in cache?Direct-mappedBlock (Line) size = 2 words(8 bytes) Byte Address 0b100100100 Index Data 00 01 10 11 Where do we look in the cache? How do we know if it is there?

  11. Example 2 - Block size=2 words Direct-Mapped Cache Valid Tag Data 0b1010001 00 0 01 0 10 0 Tag 11 0 Index Block Offset Byte Offset

  12. Definitions • Byte Offset: Which _____ within _____? • Block Offset: Which _____ within ______? • Set: Group of ______ checked each access • Index: Which ______ within cache? • Tag: Is this the right one?

  13. Definitions • Block (Line) • Hit • Miss • Hit time / Access time • Miss Penalty

  14. Example 1 – Direct-MappedBlock size=2 words Reference Stream: Hit/Miss 0b1001000 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Direct-Mapped Cache Valid Tag Data 00 0 01 0 10 0 11 0 Tag Byte Offset Index Block Offset Miss Rate:

  15. Implementation Byte Address 0b100100100 Byte Offset Tag Index Block offset Valid Tag Data 00 01 10 11 = MUX Hit? Data

  16. Example 2 • You are implementing a 64-Kbyte cache • The block size (line size) is 16 bytes. • Each word is 4 bytes • How many bits is the block offset? • How many bits is the index? • How many bits is the tag?

  17. How caches work • Classic abstraction • Each level of hierarchy has no knowledge of the configuration of lower level L2 cache’s perspective L1 cache’s perspective Me Me L1 L2 Cache Memory Memory L2 Cache DRAM DRAM

  18. Memory operation at any level Address Data 1. 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive data update cache return data 5. 2. Me L1 3. 4. Memory

  19. Address Data 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive block update cache return data Access Time Me L1 Memory Miss Penalty

  20. Performance • Hit: latency = ____________ • Miss: latency = _____________________ • Goal: minimize misses!!!

  21. Cache Writes • There are multiple copies of the data lying around • L1 cache, L2 cache, DRAM • Do we write to all of them? • Do we wait for the write to complete before the processor can proceed?

  22. Do we write to all of them? • Write-through • Write-back • creates data - different values for same item in cache and DRAM. • This data is referred to as

  23. Write-Through vs. Write-Back Sw $3, 0($5) CPU CPU L1 L1 L2 Cache L2 Cache DRAM DRAM

  24. Which performs the write faster? Which has faster evictions from a cache? Which causes more bus traffic? Write-through vs Write-back

  25. Does processor wait for write? • Write buffer • Any loads must check write buffer in parallel with cache access. • Buffer values are more recent than cache values.

  26. Challenge • DRAM is designed for _______, not _____ • DRAM is _______ than the bus • We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. • Widening anything increases the cost by quite a bit.

  27. Narrow Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

  28. Wide Configuration CPU • Given: • 1 clock cycle request • 15 cycles / 2 words DRAM latency • 1 cycle / 2 words bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

  29. Interleaved Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM DRAM

  30. Recent DRAM trends • Fewer, Bigger DRAMs • New bus protocols (RAMBUS) • small DRAM caches (page mode) • SDRAM (synchronous DRAM) • one request & length nets several continuous responses.

  31. Performance • Execute Time = (Cpu cycles + Memory-stall cycles) * clock cycle time • Memory-stall cycles = • accesses * misses * cycles = • program access miss • memory access * Miss rate * Miss penalty • program • instructions * misses * cycles = • program inst miss • instructions * misses * miss penalty • program inst

  32. Example 1 • instruction cache miss rate: 2% • data cache miss rate: 3% • miss penalty: 50 cycles • ld/st instructions are 25% of instructions • CPI with perfect cache is 2.3 • How much faster is the computer with a perfect cache?

  33. Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now?

  34. Example 1 Work • misses = Iacc * Imr + Dacc * Dmr • instr instr instr

  35. Direct-mapped CacheBlocksize=2words, wordsize= 4bytes Valid Tag Data 00 1 101 01 1 010 10 1 000 11 0 000 Reference Stream: Hit/Miss 0b00111000 0b00011100 0b00111000 0b00011000 Byte Offset Tag Index Block Offset

  36. PROBLEM Conflicting addresses cause high miss rates Size was NOT the problem!!! SOLUTION Relax direct-mapping Direct-Mapped Caches

  37. Cache Configurations Direct-Mapped Valid Tag Data 00 01 10 11 2-way Associative - each set has two blocks Valid Tag Data Valid Tag Data 0 1 Fully Associative - all addresses map to the same set Valid Tag Data Valid Tag Data

  38. 2-way Set Associative CacheBlocksize=2words, wordsize= 4bytes Index Valid Tag Data Valid Tag Data 0 1 1 1001 0000 1 1 1 0010 0001 Reference Stream: Hit/Miss 0b00111000 0b00011100 0b00111000 0b00011000 Byte Offset Tag Index Block Offset

  39. Implementation Byte Address 0x100100100 Byte Offset Tag Index Block offset Valid Tag Data Valid Tag Data 0 1 = MUX = MUX Hit? Data MUX

  40. Performance Implications • Increasing associativity increases/decreases hit rate • Increasing associativity increases/decreases access time • Increasing associativity increases/decreases miss penalty

  41. Example 2-way associative Reference Stream: Hit/Miss 0b1001000 M 0b0011100 0b1001000 0b0111000 Direct-Mapped Cache Valid Tag Data 0 0 0 1 0 0 Tag Byte Offset Index Block Offset Miss Rate:

  42. Which block to replace? • 0b1001000 • 0b001100

  43. Replacement Algorithms • LRU & FIFO simple conceptually, but implementation difficult for high assoc. • LRU & FIFO must be approximated with high associativity • Random sometimes better than approximated LRU/FIFO • Tradeoff between accuracy, implementation cost

  44. L1 cache’s perspective Me L1 L1’s miss penalty contains the access of L2, and possibly the access of DRAM!!! Memory L2 Cache DRAM

  45. Multi-level Caches • Base CPI 1.0, 500MHz clock • main memory-100 cycles, L2 - 10 cycles • L1 miss rate per instruction - 5% • w/L2 - 2% of instructions go to DRAM • What is the speedup with the L2 cache? There is a typo in the book for this example!

  46. Multi-level Caches • CPI = 1 + memory stalls / instruction

  47. Summary • Direct-mapped • simple • _____ access time • _______ hit rate • Variable block size • still simple • _______ access time

  48. Summary • Associative caches • ________ the access time • ________ the hit rate • associativity above ___ has little to no gain • Multi-level caches • __________ potential miss penalty • __________ average miss penalty

More Related