EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I

EENG 449bG/CPSC 439bG Computer SystemsLecture 17Memory Hierarchy Design Part I April 7, 2005 Prof. Andreas Savvides Spring 2005 http://www.eng.yale.edu/courses/2005s/eeng449b

µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 “Less’ Law?” DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Who Cares About the Memory Hierarchy? CPU-DRAM Gap • 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)

Review of Caches • Cache is the name given to the first level of the memory hierarchy encountered one the address leaves the CPU • Cache hit / cache miss when data is found / not found • Block – a fixed size collection of data containing the requested word • Spatial / temporal localities • Latency – time to retrieve the first word in the block • Bandwidth – time to retrieve the rest of the block • Address space is broken into fixed-size blocks called pages • Page fault – when CPU references something that is not on cache or main memory

Generations of Microprocessors • Time of a full cache miss in instructions executed: 1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136 2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320 3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648 • 1/2X latency x 3X clock rate x 3X Instr/clock  5X

Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (cost) (power) • Alpha 21164 37% 77% • StrongArm SA110 61% 94% • Pentium Pro 64% 88% • 2 dies per package: Proc/I$/D$ + L2$ • Caches have no “inherent value”, only try to close performance gap

What is a cache? • Small, fast storage used to improve average access time to slow memory. • Exploits spacial and temporal locality • In computer architecture, almost everything is a cache! • Registers “a cache” on variables – software managed • First-level cache a cache on second-level cache • Second-level cache a cache on memory • Memory a cache on disk (virtual memory) • TLB a cache on page table • Branch-prediction a cache on prediction information? Proc/Regs L1-Cache Bigger Faster L2-Cache Memory Disk, Tape, etc.

Miss-oriented Approach to Memory Access: • CPIExecution includes ALU and Memory instructions • Separating out Memory component entirely • AMAT = Average Memory Access Time • CPIALUOps does not include memory instructions Review: Cache performance

Impact on Performance • Suppose a processor executes at • Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 • 50% arith/logic, 30% ld/st, 20% control • Suppose that 10% of memory operations get 50 cycle miss penalty • Suppose that 1% of instructions get same miss penalty • CPI = ideal CPI + average stalls per instruction =1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1 • 58% of the time the proc is stalled waiting for memory!

Traditional Four Questions for Memory Hierarchy Designers • Q1: Where can a block be placed in the upper level? (Block placement) • Fully Associative, Set Associative, Direct Mapped • Q2: How is a block found if it is in the upper level? (Block identification) • Tag/Block • Q3: Which block should be replaced on a miss? (Block replacement) • Random, LRU • Q4: What happens on a write? (Write strategy) • Write Back or Write Through (with Write Buffer)

Q1: Where can a Block Be Placed in a Cache?

Set Associatively • Direct mapped = one-way set associative • (block address) MOD (Number of blocks in cache) • Fully associative = set associative with 1 set • Block can be placed anywhere in the cache • Set associative – a block can be placed in a restricted set of places • Block first mapped to a set and then placed anywhere in the set • (Block address) MOD (Number of sets in cache) • If there are n blocks in a set, then the cache is n-way set associative • Most popular cache configurations in today’s processors • Direct mapped, 2-way set associative, 4-way set associative

Q2: How is a block found if it is in the cache? Selects the set Selects the desired data from the block Compared against for a hit • If cache size remains the same increasing associativity increases • The number of blocks per set => decrease index size and increase tag

Q3: Which Block Should be Replaced on a Cache Miss? • Directly mapped cache • No choice – a single block is checked for a hit. If there is a miss, data is fetched into that block • Fully associative and set associative • Random • Least Recently Used(LRU) – locality principles • First in, First Out(FIFO) - Approximates LRU

Q4: What Happens on a Write? • Cache accesses dominated by reads • E.g on MIPS 10% stores and 37% loads = 21% of cache traffic is writes • Writes are much slower than reads • Block is read from cache at the same time a block is read and compared • If the write is a hit the block is passed to the CPU • Writing cannot begin unless the address is a hit • Write through – information is written to both the cache and the lower-level memory • Write back – information only written to cache. Written to memory only on block replacement • Dirty bit – used to indicate whether a block has been changed While in cache

Write Through vs. Write Back • Write back – uses cache speed, all entries updated once during the writing of a block • Write through – slower, BUT cache is always clean • Cache read misses never result in writes at the lower level • Next lower level of the cache has the most current copy of the data

Example: Alpha 21264 Data Cache • 2-way set associative • Write back • Each block has 64 bytes of data • Offset points to the data we want • Total cache size 65,536 bytes • Index 29=512 points to the block • Tag comparison determines if we have a hit • Victim buffer to helps with write back

Address Breakdown • Physical address is 44 bits wide, 38-bit block address and 6-bit offset (2^6=64) • Calculating cache index size field • Blocks are 64 bytes so offset needs 6 bits • Tag size = 38 – 9 = 29 bits

Writing to cache • If word to be written is in the cache, first 3 steps are the same as read • The 21264 processor uses write back so the block cannot simply discarded on a miss • If the “victim” was modified its data and address are sent to the victim buffer. • Data cache cannot apply all processor needs • Separate Instruction and Data Caches may be needed

Proc Proc I-Cache-1 Proc D-Cache-1 Unified Cache-1 Unified Cache-2 Unified Cache-2 Unified vs Split Caches • Unified vs Separate Instruction and Data caches • Example: • 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% • 32KB unified: Aggregate miss rate=1.99% • Using miss rate in the evaluation may be misleading! • Which is better (ignore L2 cache)? • Assume 25% data ops  75% accesses from instructions (1.0/1.33) • hit time=1, miss time=50 • Note that data hit has 1 stall for unified cache (only one port) AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Impact of Caches on Performance • Consider a in-order execution computer • Cache miss penalty 100 clock cycles, CPI=1 • Average miss rate 2% and an average of 1.5 memory references per instruction • Average # of cache misses: 30 per 1000 instructions Performance with cache misses

Impact of Caches on Performance • Calculate the performance using miss rate • 4x increase in CPU time from “perfect cache” • No cache – 1.0 + 100 x 1.5 = 151 – a factor of 40 compared to a system • with cache • Minimizing memory access does not always imply reduction in CPU time

How to Improve Cache Performance? Four main categories of optimizations • Reducing miss penalty - multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches • Reducing miss rate - larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity and computer optimizations 2. Reduce the miss penalty or miss rate via parallelism - non-blocking caches, hardware prefetching and compiler prefetching 3. Reduce the time to hit in the cache - small and simple caches, avoiding address translation, pipelined cache access

Where to misses come from? • Classifying Misses: 3 Cs • Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache) • Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache) • Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)

3Cs Absolute Miss Rate (SPEC92) Conflict Miss rate

Cache Organization? • Assume total cache size not changed: • What happens if: • Change Block Size: • Change Associativity: 3) Change Compiler: Which of 3Cs is obviously affected?

Increasing Block Size • Larger block sizes reduce compulsory misses • Takes advantage of spatial locality • Larger blocks increase miss penalty • Our goal: reduce miss rate and miss penalty! • Block size selection depends on latency and bandwidth of lower level memory • High latency & high BW => large block sizes • Low latency & low BW => smaller block sizes

Reduced compulsory misses Increased Conflict Misses Larger Block Size (fixed size&assoc) What else drives up block size?

Cache Size • Old rule of thumb: 2x size => 25% cut in miss rate • What does it reduce?

Increase Associativity • Higher set-associativity improves miss rates • Rule of thumb for caches • A direct mapped cache of size N has about the same miss rate as a two-way, set-associative cache of N/2 • Holds for < 128Kb • Disadvantages • Reduces miss rate but increases miss penalty • Greater associativity can come at the cost of inreased hit time.

Associativity Conflict

3Cs Relative Miss Rate Conflict Flaws: for fixed block size Good: insight => invention

Associativity vs Cycle Time • Beware: Execution time is only final measure! • Why is cycle time tied to hit time? • Will Clock Cycle time increase?

Next Time • Cache Tradeoffs for Performance • Reducing Hit Times

EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Memory Hierarchy Design Part I