400 likes | 411 Views
This article explains the importance of cache performance in computer systems and the hierarchy of cache levels. It also covers different cache mapping techniques such as direct mapping, fully associative, and set associative.
E N D
Timing • Total Execution cycles = • Program execution cycles • Includes cache hit time • Memory stall cycles • Mainly from cache misses
Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles / miss • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • What is the CPI cost of cache misses?How many times faster would perfect cache be?
Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • Instruction misses • All instructions miss 2% at cost 100 per miss= 2 CPI
Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • Data misses • 36% of instructions miss 4% of the time= 1.44 CPI
Cache Performance Example • Given: • Level 1 only, separate Instruction and Data Caches • I-cache miss rate = 2% • D-cache miss rate = 4% • Miss penalty = 100 cycles • Base CPI (including accessing ideal cache) = 2 • Load & stores are 36% of instructions • Miss penalty = 2 + 1.44 = 3.44CPI penalty • 5.44 total CPI • Speedup of perfect cache: Speedup = 2.72 times
Speedup vs HitRate • Need high hit rate for large speedup • k = 1/cycles per miss https://ggbm.at/CXfa6mCP
Hierarchy • Often multiple levels of cache • Bigger # usually means • Larger cache • Slower
Process • I need memory location 0x000E • Is it in L1 cache? • Yes : Hit – use it • No : Miss – go search next level • Is it in L2? • Yes : Hit – use it • No : Miss – go search next level • Is it in L3… • Is it in memory…
Cache • L2 & L3 • May be on chip or board • May be shared by cores • ~ 1 MB (L2) ~5-10 MB (L3)
Differences • No hard rulesabout • What cache you have • Where it lives
Multi Level Example • Given • CPU base CPI = 1 • Clock rate = 4GHz (or 0.25 ns / cycle) • L1 miss rate/instruction = 2% • Main memory access time = 100ns • What is effective CPI?
Multi Level Example • Given • CPU base CPI = 1 • Clock rate = 4GHz (or 0.25 ns / cycle) • L1 miss rate/instruction = 2% • Main memory access time = 100ns / miss • Effective CPI = base + miss penalty • Miss penalty = = 8 CPI • Effective CPI = 1 + 8 = 9 CPI
Multi Level Example • Given • CPU base CPI = 1 • Clock rate = 4GHz (or 0.25 ns / cycle) • L1 miss rate/instruction = 2% • Main memory access time = 100ns • Add a L2 Cache • 5ns per access • 0.5% global miss rate to memory • i.e. 25% hit rate on misses from L1 (0.5 / 2) • How many times faster is it than just L1?
Multi Level Example • Total CPI = Base CPI + L1 miss and L2 hit + L2 miss • L1 miss and L2 hit: = 0.4 CPI • L2 miss: = 2 CPI
Multi Level Example • Total CPI = Base CPI + L1 miss and L2 hit + L2 miss • L1 miss and L2 hit: = 0.4 CPI • Or L2 miss: = 2 CPI
Multi Level Example • Total CPI = Base CPI + L1 miss and L2 hit + L2 miss 1 + 0.4 + 2 • Total CPI with L2 = 3.4 • Speedup over just L1: Speedup = 2.6 times
Associativity • Associativity • What chunks of memory can go in which cache lines
Direct Mapping • Direct mapping : every memory line has one cache entry it can use
Direct Mapped Cache • Issue : Thrashing • Reusing same line rapidlyfor different sets
Direct Mapped Cache • Adding arrays in a loop:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 1 (miss – kills Set 0)0x0044 = 0x0004 + 0x0014 Read Line 0 / Set 0 (miss – kills Set 1) Read Line 1 / Set 0 (hit) Write Line 0 / Set 1 (miss)
Fully Associative • Fully associative cache • Any memory line can go in any cache entry
Fully Associative Cache • Minimal thrashing • Put wherever you want • When space neededevict oldest
Fully Associative Cache • Minimal thrashing • Put wherever you want • When space neededevict oldest Set 0 Set 4 Set 2
Fully Associative • Issues: • Must check all tags in parallel for a match • Large amounts of hardware • Most practical for very small caches
Set Associative • n-way Set Associative : every memory block has n-slots it can be in • 2-way
Set Associative • n-way Set Associative : every memory block has n-slots it can be in • 4-way
Set Associative Address • Example – 2-way • 2 cache entries • Each holds two items (A & B)
Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0020 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss) Set 0
Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss) Set 0 Set 0
Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss) Set 2 Set 0 Set 0
Set Associative Address • 2 way set associative:0x0040 = 0x0000+ 0x0010 Read Line 0 / Set 0 (miss) Read Line 1 / Set 0 (miss) Write Line 0 / Set 2 (miss)0x0044 = 0x0004 + 0x0014 Read Line 0 / Set 0 (hit) Read Line 1 / Set 0 (hit) Write Line 0 / Set 2 (hit) Set 2 Set 0 Set 0
Associativity Compared • Cache with space for 8 entries could be • 8 Blocks direct map • 4 Sets2 lines capacity • 2 Sets4 lines capacity • Fully Associative8 lines capacity
4-Way Implementation • 32 bit addresses, 256 cache lines, each line holds 1 word of memory (4 bytes) • Address broken up: • 2 bits for offset in line4 addresses/line • 8 bits for line256 lines • 22 bits for tageverything else
4-Way Implementation • Address:0101 1101 0010 0010 0011 0010 0010 1100 • 2 bits for offset in line00Byte 0 in line • 8 bits for line10 0010 11Index 139 in cache • 22 bits for tag0101 1101 0010 …
4-Way Implementation • Address:0101 1101 0010 0010 0011 0010 0010 1100 • Cache index is 139 • Tag is0101 1101 0010 … • Check all 4 tags in index 139 looking formatch
Set Accociative Performance • Larger caches = higher hit rate • Smaller caches benefit more from associativity
Replacement Strategies • How do what block to kick out? • FIFO : Track age • Least Used : Track accesses • Very susceptible to thrashing • Least Recently Used : Track age of accesses • Very complex for larger caches • Random
Update Strategies • When we store to memory, what gets updated? • Write Through : Update all levels of cache and memory • - Have to stall and wait for slowest component to finish update • + Consistency between levels • + Simple • Write Back : Just update cache. Update memory when leave cache • - Complex • - Lack of consistency between levels • + Faster – only stall for current level
What do they use? • Intell Nehalem& ARM Cortex A-8(ARM v7)