120 likes | 129 Views
This chapter explores various techniques to reduce cache miss rates, including understanding the causes of misses, the impact of block size, cache associativity, way prediction, and compiler optimizations.
E N D
Chapter 5 Memory Hierarchy Desigin
5.5 Reducing Miss Rate • Causes of misses • Compulsory (also called cold-start, or first-reference): the very first access to a block • Capacity: cannot contain all blocks • Conflict (also called collision, or interference): multiple blocks in memory mapped into the same cache block • Breakdown of cache misses Fig. 5.14, Fig. 5.15 • Infinite size cache: no capacity misses and no conflict misses only compulsory misses • Compulsory misses are relatively small • Fully associative cache: no conflict misses compulsory misses and capacity misses • Infinite size small size: capacity misses increase • Full associativity lower-associativity: conflict misses increase • Trash: data movement to/from the lower-level memory spending most time in data movement
Larger Block Size • Same size, different block size • Larger block size • Reduces miss rate due to spatial locality • May increase conflict and capacity miss • May increase miss penalty • Miss rate vs. block size Fig. 5.16 & Fig. 5.17 cache cache
Larger Block Size & Larger Caches • Larger Block Size (continued) • Ex) Which block size has the smallest average memory access time • Assumption: memory access: 80 cycles latency+16 bytes/2 cycles • Ans) Average memory access time = Hit_time + Miss_rate x Miss_penalty • 1KB cache w/ 16-byte block: • average memory access time = 1 + (15.05% x 82) = 13.341 cycles • 256KB cache w/ 256-byte block • average memory access time = 1 + (0.49% x 112) = 1.549 cycles • Fig. 5.18 • Popular block size: 4KB cache 32 bytes, larger cache 64 bytes • Selection of block size depends on the lower memory performance: • High latency and high bandwidth large block size • Low latency and low bandwidth smaller block size • Larger caches • Reduce capacity misses • Increase hit time and cost
Associativity • 1-way set associative (direct-mapped) CPU Stored only one place in the cache 1st-cache 1 2 3 4 2nd-cache 1 2 3 4 Main memory 3
Associativity • 2-way set associative CPU Can be stored in two places 1st-cache 1st-cache 1 2 3 4 1 2 3 4 2nd-cache 1 2 3 4 Main memory 3
Higher Associativity • Miss rate improves with higher associativity • Two general rules of thumb • Eight-way associativity almost as effective as full associativity • 2:1 cache rule: Direct-mapped cache of size N about the same miss rate as a two-way set-associative cache of size N/2 • Drawback: hit time is increased with associativity • Example) Average memory access time • clock cycle time2-way = 1.36 x clock cycle time1-way • clock cycle time4-way = 1.44 x clock cycle time1-way • clock cycle time8-way = 1.52 x clock cycle time1-way • Miss penalty for the direct-mapped case is 25 clock cycles to an L2 cache that never misses • Miss rate Fig. 5.14
Higher Associativity • Example) Average memory access time • Ans) • Average memory access time8-way = 1.52 + Miss_rate8-way x 25 • If 512 KB, then 1.52 + 0.006 x 25 = 1.66 • Average memory access time4-way = 1.44 + Miss_rate4-way x 25 • Average memory access time2-way = 1.36 + Miss_rate2-way x 25 • Average memory access time1-way = 1.00 + Miss_rate1-way x 25 • If 4 KB, then 1.00 + 0.133 x 25 = 3.44 • Results Fig. 5.19 • Four-way is better up to 8 KB • Direct-mapped is better from 16 KB
Way Prediction and Pseudoassociative Caches • Way prediction: • Use a bit predicts the way (the set) in the set-associative cache • Multiplexer is set early to select the next desired block • Hit: only single tag comparison the same speed as direct mapped cache • Miss: check the other blocks in subsequent clock cycles • Save power by comparing a single block • Alpha 21264 • 2-way set associative • A single prediction bit per a block • Correct prediction: 1 cycle hit time • Incorrect prediction: 3 cycle hit time • Prediction accuracy: 85% • Pseudoassociative or column associative • Hit time Fig. 5.20 • On hit, behaves the same as a direct-mapped cache • On miss, check a second cacheentry in the “pseudoset” called “pseudohit” • Eg) access the entry with the inverted MSB of the index • Drawback • Slow pseudo hit may replace fast hit • Slightly slower miss penalty
Compiler Optimizations • Code optimization • Reordering procedures (or instructions) may reduce conflict misses • Aligning basic blocks such that the entry point at the beginning of a cache block • Reduce the cache miss for sequential code • Data optimization: improve spatial & temporal locality of data by reordering code • Loop interchange • Original code: memory access with the stride of 100 words for(j=0; j<100; j++) for(i=0; i<5000; i++) x[i][j] = 2 * x[i][j]; • Execution order x[0][0] = 2 * x[0][0] x[1][0] = 2 * x[1][0] x[2][0] = 2 * x[2][0] …… x[4999][0] = 2 * x[4999][0] x[0][1] = 2 * x[0][1] …… x[0][0] x[0][1] x[0][2] x[0][99] x[1][0] x[1][1] x[1][2] x[1][99] x[2][0] x[2][1] x[2][2] x[1][99]
Compiler Optimizations • Data optimization (continued) • Loop interchange (continued) • New code: memory access with the stride of 1 word improve spatial locality for(i=0; i<5000; i++) for(j=0; j<100; j++) x[i][j] = 2 * x[i][j]; • Execution order x[0][0] = 2 * x[0][0] x[0][1] = 2 * x[0][1] x[0][2] = 2 * x[0][2] …… x[0][99] = 2 * x[0][99] x[1][0] = 2 * x[1][0] …… x[0][0] x[0][1] x[0][2] x[0][99] x[1][0] x[1][1] x[1][2] x[1][99] x[2][0] x[2][1] x[2][2] x[1][99]
Compiler Optimization • Blocking: maximizes accesses to the data loaded into the cache before the data are replaced • Ex) matrix multiplication for (i=0; i<N; i++) for (j=0; j<N; j++) {r=0; for (k=0; k<N; k++) r=r+y[i][k]*z[k][j]; x[i][j]=r; } • Data access pattern: Fig. 5.21 • After blocking transform for (jj=0; jj<N; jj = jj+B) for (kk=0; kk<N; jj = kk+B) for (i=0; i<N; i++) for (j=jj; j<min(jj+B,N); j++) {r=0; for (k=kk; k<min(kk+B,N); k++) r=r+y[i][k]*z[k][j]; x[i][j]=r; } • Data access pattern: Fig. 5.22