Embedded Computer Architecture Memory Hierarchy: Cache Recap

Embedded Computer ArchitectureMemory Hierarchy: Cache Recap Course 5KK73 HenkCorporaal November 2013 h.corporaal@tue.nl

CPU Level 1 Level 2 Speed Level n Size Memory Hierarchy, why? • Users want large and fast memories! SRAM access times are 1 – 10 nsDRAM access times are 20-120 nsDisk access times are 5 to 10 million ns, but it’s bits are very cheap • Get best of both worlds: fast and large memories: • build a memory hierarchy

Exploiting Locality • Locality = principle that makes having a memory hierarchy a good idea • If an item is referenced,temporal locality: it will tend to be referenced again soon spatial locality : nearby items will tend to be referenced soon. Why does code have locality? • Our initial focus: two levels (upper, lower) • block: minimum unit of data • hit: data requested is in the upper level • miss: data requested is not in the upper level upper level $ block lower level

Cache operation Cache / Higher level Memory / Lower level block / line tags data

Direct Mapped Cache • Mapping: cache address is memory address modulo the number of blocks in the cache

Direct Mapped Cache Address (bit positions) 3 1 3 0 1 3 1 2 1 1 2 1 0 Q:What kind of locality are we taking advantage of in this example? B y t e o f f s e t 1 0 2 0 H i t D a t a T a g I n d e x I n d e x V a l i d T a g D a t a 0 1 2 1 0 2 1 1 0 2 2 1 0 2 3 2 0 3 2

Direct Mapped Cache • This example exploits (also) spatial locality (having larger blocks): Address (bit positions)

Hits vs. Misses • Read hits • this is what we want! • Read misses • stall the CPU, fetch block from memory, deliver to cache, restart the load instruction • Write hits: • can replace data in cache and memory (write-through) • write the data only into the cache (write-back the cache later) • Write misses: • read the entire block into the cache, then write the word (allocate on write miss) • do not read the cache line; just write to memory (no allocate on write miss)

Splitting first level cache • Use split Instruction and Data caches • Caches can be tuned differently • Avoids dual ported cache Main Memory I$ I&D $ CPU D$ L1 L2

Let’s look at cache&memory performance Texec = Ncycles • Tcycle=Ninst• CPI • Tcycle with CPI = CPIideal + CPIstall CPIstall = %reads • missrateread • misspenaltyread+ %writes • missratewrite • misspenaltywrite or: Texec = (Nnormal-cycles + Nstall-cycles ) • Tcycle with Nstall-cycles = Nreads • missrateread • misspenaltyread+ Nwrites • missratewrite • misspenaltywrite (+ Write-buffer stalls )

Performance example (1) • Assume application with: • Icachemissrate 2% • Dcachemissrate 4% • Fraction of ld-st instructions = 36% • CPI ideal (i.e. without cache misses) is 2.0 • Misspenalty 40 cycles • Calculate CPI taking misses into account CPI = 2.0 + CPIstall CPIstall = Instruction-miss cycles + Data-miss cycles Instruction-miss cycles = Ninstr x 0.02 x 40 = 0.80 Ninstr Data-miss cycles = Ninstr x %ld-st x 0.04 x 40 CPI = 3.36 Slowdown: 1.68 !!

Performance example (2) 1. What if ideal processor had CPI = 1.0 (instead of 2.0) • Slowdown would be 2.36 ! 2. What if processor is clocked twice as fast • => penalty becomes 80 cycles • CPI = 4.75 • Speedup = N.CPIa.Tclock / (N.CPIb.Tclock/2) = 3.36 / (4.75/2) • Speedup is not 2, but only 1.41 !!

Improving cache / memory performance • Ways of improving performance: • decreasing the miss ratio (avoiding conflicts): associativity • decreasing the miss penalty: multilevel caches • Adapting block size: see earlier slides • Note: there are many more ways to improve memory performance (see e.g. master course 5MD00)

How to reduce CPIstall ? CPIstall = %reads • missrateread • misspenaltyread+ %writes • missratewrite • misspenaltywrite Reduce missrate: • Larger cache • Avoids capacity misses • However: a large cache may increase Tcycle • Larger block (line) size • Exploits spatial locality: see previous lecture • Associative cache • Avoids conflict misses Reduce misspenalty: • Add 2nd level of cache

Decreasing miss ratio with associativity 2 blocks / set block 4 blocks / set 8 blocks / set

An implementation: 4 way associative

Performance of Associative Caches 1 KB 2 KB 8 KB

Further Cache Basics • cache_size = Nsets x Associativity x Block_size • block_address = Byte_address DIV Block_size in bytes • index size = Block_address MOD Nsets • Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently block address block offset tag index bit 31 … … 2 1 0

Comparingdifferent (1-level) caches (1) • Assume • Cache of 4K blocks • 4 word block size • 32 bit address • Direct mapped (associativity=1) : • 16 bytes per block = 2^4 • 32 bit address : 32-4=28 bits for index and tag • #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index • Total number of tag bits : (28-12)*4K=64 Kbits • 2-way associative • #sets=#blocks/associativity : 2K sets • 1 bit less for indexing, 1 bit more for tag • Tag bits : (28-11) * 2 * 2K=68 Kbits • 4-way associative • #sets=#blocks/associativity : 1K sets • 1 bit less for indexing, 1 bit more for tag • Tag bits : (28-10) * 4 * 1K=72 Kbits

Comparingdifferent (1-level) caches (2) • 3 caches consisting of 4 one-word blocks: • Cache 1 : fully associative • Cache 2 : two-way set associative • Cache 3 : direct mapped • Suppose following sequence of block addresses: 0, 8, 0, 6, 8

Direct Mapped Coloured = new entry = miss

2-waySet Associative: 2 sets (so all in set/location 0) LEAST RECENTLY USED BLOCK

Fullyassociative (4 wayassoc., 1 set)

Review: Four Questions for Memory Hierarchy Designers • Q1: Where can a block be placed in the upper level? (Block placement) • Fully Associative, Set Associative, Direct Mapped • Q2: How is a block found if it is in the upper level? (Block identification) • Tag/Block • Q3: Which block should be replaced on a miss? (Block replacement) • Random, FIFO, LRU • Q4: What happens on a write? (Write strategy) • Write Back or Write Through (with Write Buffer)

Classifying Misses: the 3 Cs • The 3 Cs: • Compulsory—First access to a block is always a miss. Also called cold start misses • misses in infinite cache • Capacity—Misses resulting from the finite capacity of the cache • misses in fully associative cache with optimal replacement strategy • Conflict—Misses occurring because several blocks map to the same set. Also called collision misses • remaining misses

3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size not changed What happens if we: 1) Change Block Size: Which of 3Cs is obviously affected? compulsory 2) Change Cache Size: Which of 3Cs is obviously affected? capacity misses 3) Introduce higher associativity : Which of 3Cs is obviously affected? conflict misses

3Cs Absolute Miss Rate (SPEC92) Conflict Miss rate per type

Second Level Cache (L2) • Most CPUs • have an L1 cache small enough to match the cycle time (reduce the time to hit the cache) • have an L2 cache large enough and with sufficient associativity to capture most memory accesses (reduce miss rate) • L2 Equations, Average Memory Access Time (AMAT): AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 +Miss RateL1x (Hit TimeL2 +Miss RateL2x MissPenaltyL2) • Definitions: • Local miss rate— misses in this cache divided by the total number of memory accessesto this cache (Miss rateL2) • Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU(Miss RateL1 x Miss RateL2)

Second Level Cache (L2) • Suppose processor with base CPI of 1.0 • Clock rate of 500 Mhz • Main memory access time : 200 ns • Miss rate per instruction primary cache : 5% What improvement with second cache having 20ns access time, reducing miss rate to memory to 2% ? • Miss penalty : 200 ns/ 2ns per cycle=100 clock cycles • Effective CPI=base CPI+ memory stall per instruction = ? • 1 level cache : total CPI=1+5%*100=6 • 2 level cache : a miss in first level cache is satisfied by second cache or memory • Access second level cache : 20 ns / 2ns per cycle=10 clock cycles • If miss in second cache, then access memory : in 2% of the cases • Total CPI=1+primary stalls per instruction +secondary stalls per instruction • Total CPI=1+5%*10+2%*100=3.5 Machine with L2 cache : 6/3.5=1.7 times faster

Second Level Cache • Global cache miss is similar to single cache miss rate of second level cache provided L2 cache is much bigger than L1. • Local cache rate is NOT good measure of secondary caches as it is function of L1 cache. • Global cache miss rate should be used.

Second Level Cache

How to connect the cache to next level? • Make reading multiple words easier by using banks of memory • It can get a lot more complicated...

Embedded Computer Architecture Memory Hierarchy: Cache Recap

Embedded Computer Architecture Memory Hierarchy: Cache Recap

Presentation Transcript

Chapter 2: Computer-System Structures

KK4504 : Computer Architecture

Intel Pentium 4: A Detailed Description

Chapter 5. The Memory System

Advanced Computer Architecture Memory Hierarchy Design

Chapter 5: Security Architecture

Advanced Computer Architecture Memory Hierarchy Design

Computer Architecture Shared Memory MIMD Architectures

Inside RAC

Embedded System Overview

CS 201 The Memory Hierarchy II

Chapter 5

Csci 136 Computer Architecture II –Cache Memory

Computer Organization and Architecture

COS2014 IA-32 Processor Architecture

Virtual Memory

CS6303 COMPUTER ARCHITECTURE.

Secondary Storage Management

Chapter 4 Internal Memory

Advanced RECAP Workshop

Seungbeom Ma (ID 125)