1 / 40

Cache Part 2

This article explains the importance of cache performance in computer systems and the hierarchy of cache levels. It also covers different cache mapping techniques such as direct mapping, fully associative, and set associative.

altamirano
Download Presentation

Cache Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Part 2

  2. Timing • Total Execution cycles = • Program execution cycles • Includes cache hit time • Memory stall cycles • Mainly from cache misses

  3. Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles / miss • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • What is the CPI cost of cache misses?How many times faster would perfect cache be?

  4. Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • Instruction misses • All instructions miss 2% at cost 100 per miss= 2 CPI

  5. Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • Data misses • 36% of instructions miss 4% of the time= 1.44 CPI

  6. Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • Miss penalty = 2 + 1.44 = 3.44CPI penalty • 5.44 total CPI • Speedup of perfect cache: Speedup = 2.72 times

  7. Speedup vs HitRate • Need high hit rate for large speedup • k = 1/cycles per miss https://ggbm.at/CXfa6mCP

  8. Hierarchy • Often multiple levels of cache • Bigger # usually means • Larger cache • Slower

  9. Process • I need memory location 0x000E • Is it in L1 cache? • Yes : Hit – use it • No : Miss – go search next level • Is it in L2? • Yes : Hit – use it • No : Miss – go search next level • Is it in L3… • Is it in memory…

  10. Cache • L2 & L3 • May be on chip or board • May be shared by cores • ~ 1 MB (L2) ~5-10 MB (L3)

  11. Differences • No hard rulesabout • What cache you have • Where it lives

  12. Multi Level Example • Given • CPU base CPI = 1 • Clock rate = 4GHz (or 0.25 ns / cycle) • L1 miss rate/instruction = 2% • Main memory access time = 100ns • What is effective CPI?

  13. Multi Level Example • Given • CPU base CPI = 1 • Clock rate = 4GHz (or 0.25 ns / cycle) • L1 miss rate/instruction = 2% • Main memory access time = 100ns / miss • Effective CPI = base + miss penalty • Miss penalty = = 8 CPI • Effective CPI = 1 + 8 = 9 CPI

  14. Multi Level Example • Given • CPU base CPI = 1 • Clock rate = 4GHz (or 0.25 ns / cycle) • L1 miss rate/instruction = 2% • Main memory access time = 100ns • Add a L2 Cache • 5ns per access • 0.5% global miss rate to memory • i.e. 25% hit rate on misses from L1 (0.5 / 2) • How many times faster is it than just L1?

  15. Multi Level Example • Total CPI = Base CPI + L1 miss and L2 hit + L2 miss • L1 miss and L2 hit: = 0.4 CPI • L2 miss: = 2 CPI

  16. Multi Level Example • Total CPI = Base CPI + L1 miss and L2 hit + L2 miss • L1 miss and L2 hit: = 0.4 CPI • Or L2 miss: = 2 CPI

  17. Multi Level Example • Total CPI = Base CPI + L1 miss and L2 hit + L2 miss 1 + 0.4 + 2 • Total CPI with L2 = 3.4 • Speedup over just L1: Speedup = 2.6 times

  18. Associativity • Associativity • What chunks of memory can go in which cache lines

  19. Direct Mapping • Direct mapping : every memory line has one cache entry it can use

  20. Direct Mapped Cache • Issue : Thrashing • Reusing same line rapidlyfor different sets

  21. Direct Mapped Cache • Adding arrays in a loop:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 1 (miss – kills Set 0)0x0044 = 0x0004 + 0x0014 Read Line 0 / Set 0 (miss – kills Set 1) Read Line 1 / Set 0 (hit) Write Line 0 / Set 1 (miss)

  22. Fully Associative • Fully associative cache • Any memory line can go in any cache entry

  23. Fully Associative Cache • Minimal thrashing • Put wherever you want • When space neededevict oldest

  24. Fully Associative Cache • Minimal thrashing • Put wherever you want • When space neededevict oldest Set 0 Set 4 Set 2

  25. Fully Associative • Issues: • Must check all tags in parallel for a match • Large amounts of hardware • Most practical for very small caches

  26. Set Associative • n-way Set Associative : every memory block has n-slots it can be in • 2-way 

  27. Set Associative • n-way Set Associative : every memory block has n-slots it can be in • 4-way 

  28. Set Associative Address • Example – 2-way • 2 cache entries • Each holds two items (A & B)

  29. Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0020 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss) Set 0

  30. Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss) Set 0 Set 0

  31. Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss) Set 2 Set 0 Set 0

  32. Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss)0x0044 = 0x0004 + 0x0014 Read Line 0 / Set 0 (hit) Read Line 1 / Set 0 (hit) Write Line 0 / Set 2 (hit) Set 2 Set 0 Set 0

  33. Associativity Compared • Cache with space for 8 entries could be • 8 Blocks direct map • 4 Sets2 lines capacity • 2 Sets4 lines capacity • Fully Associative8 lines capacity

  34. 4-Way Implementation • 32 bit addresses, 256 cache lines, each line holds 1 word of memory (4 bytes) • Address broken up: • 2 bits for offset in line4 addresses/line • 8 bits for line256 lines • 22 bits for tageverything else

  35. 4-Way Implementation • Address:0101 1101 0010 0010 0011 0010 0010 1100 • 2 bits for offset in line00Byte 0 in line • 8 bits for line10 0010 11Index 139 in cache • 22 bits for tag0101 1101 0010 …

  36. 4-Way Implementation • Address:0101 1101 0010 0010 0011 0010 0010 1100 • Cache index is 139 • Tag is0101 1101 0010 … • Check all 4 tags in index 139 looking formatch

  37. Set Accociative Performance • Larger caches = higher hit rate • Smaller caches benefit more from associativity

  38. Replacement Strategies • How do what block to kick out? • FIFO : Track age • Least Used : Track accesses • Very susceptible to thrashing • Least Recently Used : Track age of accesses • Very complex for larger caches • Random

  39. Update Strategies • When we store to memory, what gets updated? • Write Through : Update all levels of cache and memory • - Have to stall and wait for slowest component to finish update • + Consistency between levels • + Simple • Write Back : Just update cache. Update memory when leave cache • - Complex • - Lack of consistency between levels • + Faster – only stall for current level

  40. What do they use? • Intell Nehalem& ARM Cortex A-8(ARM v7)

More Related