1 / 12

Improving Cache Performance and Reducing Miss Rate

This chapter explores various techniques to reduce cache miss rates, including understanding the causes of misses, the impact of block size, cache associativity, way prediction, and compiler optimizations.

cahil
Download Presentation

Improving Cache Performance and Reducing Miss Rate

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 5 Memory Hierarchy Desigin

  2. 5.5 Reducing Miss Rate • Causes of misses • Compulsory (also called cold-start, or first-reference): the very first access to a block • Capacity: cannot contain all blocks • Conflict (also called collision, or interference): multiple blocks in memory mapped into the same cache block • Breakdown of cache misses  Fig. 5.14, Fig. 5.15 • Infinite size cache: no capacity misses and no conflict misses  only compulsory misses • Compulsory misses are relatively small • Fully associative cache: no conflict misses  compulsory misses and capacity misses • Infinite size  small size: capacity misses increase • Full associativity  lower-associativity: conflict misses increase • Trash: data movement to/from the lower-level memory spending most time in data movement

  3. Larger Block Size • Same size, different block size • Larger block size • Reduces miss rate due to spatial locality • May increase conflict and capacity miss • May increase miss penalty • Miss rate vs. block size  Fig. 5.16 & Fig. 5.17 cache cache

  4. Larger Block Size & Larger Caches • Larger Block Size (continued) • Ex) Which block size has the smallest average memory access time • Assumption: memory access: 80 cycles latency+16 bytes/2 cycles • Ans) Average memory access time = Hit_time + Miss_rate x Miss_penalty • 1KB cache w/ 16-byte block: • average memory access time = 1 + (15.05% x 82) = 13.341 cycles • 256KB cache w/ 256-byte block • average memory access time = 1 + (0.49% x 112) = 1.549 cycles •  Fig. 5.18 • Popular block size: 4KB cache  32 bytes, larger cache  64 bytes • Selection of block size depends on the lower memory performance: • High latency and high bandwidth  large block size • Low latency and low bandwidth  smaller block size • Larger caches • Reduce capacity misses • Increase hit time and cost

  5. Associativity • 1-way set associative (direct-mapped) CPU Stored only one place in the cache 1st-cache 1 2 3 4 2nd-cache 1 2 3 4 Main memory 3

  6. Associativity • 2-way set associative CPU Can be stored in two places 1st-cache 1st-cache 1 2 3 4 1 2 3 4 2nd-cache 1 2 3 4 Main memory 3

  7. Higher Associativity • Miss rate improves with higher associativity • Two general rules of thumb • Eight-way associativity almost as effective as full associativity • 2:1 cache rule: Direct-mapped cache of size N about the same miss rate as a two-way set-associative cache of size N/2 • Drawback: hit time is increased with associativity • Example) Average memory access time • clock cycle time2-way = 1.36 x clock cycle time1-way • clock cycle time4-way = 1.44 x clock cycle time1-way • clock cycle time8-way = 1.52 x clock cycle time1-way • Miss penalty for the direct-mapped case is 25 clock cycles to an L2 cache that never misses • Miss rate  Fig. 5.14

  8. Higher Associativity • Example) Average memory access time • Ans) • Average memory access time8-way = 1.52 + Miss_rate8-way x 25 • If 512 KB, then 1.52 + 0.006 x 25 = 1.66 • Average memory access time4-way = 1.44 + Miss_rate4-way x 25 • Average memory access time2-way = 1.36 + Miss_rate2-way x 25 • Average memory access time1-way = 1.00 + Miss_rate1-way x 25 • If 4 KB, then 1.00 + 0.133 x 25 = 3.44 • Results  Fig. 5.19 • Four-way is better up to 8 KB • Direct-mapped is better from 16 KB

  9. Way Prediction and Pseudoassociative Caches • Way prediction: • Use a bit predicts the way (the set) in the set-associative cache • Multiplexer is set early to select the next desired block • Hit: only single tag comparison  the same speed as direct mapped cache • Miss: check the other blocks in subsequent clock cycles • Save power by comparing a single block • Alpha 21264 • 2-way set associative • A single prediction bit per a block • Correct prediction: 1 cycle hit time • Incorrect prediction: 3 cycle hit time • Prediction accuracy: 85% • Pseudoassociative or column associative • Hit time  Fig. 5.20 • On hit, behaves the same as a direct-mapped cache • On miss, check a second cacheentry in the “pseudoset”  called “pseudohit” • Eg) access the entry with the inverted MSB of the index • Drawback • Slow pseudo hit may replace fast hit • Slightly slower miss penalty

  10. Compiler Optimizations • Code optimization • Reordering procedures (or instructions) may reduce conflict misses • Aligning basic blocks such that the entry point at the beginning of a cache block • Reduce the cache miss for sequential code • Data optimization: improve spatial & temporal locality of data by reordering code • Loop interchange • Original code: memory access with the stride of 100 words for(j=0; j<100; j++) for(i=0; i<5000; i++) x[i][j] = 2 * x[i][j]; • Execution order x[0][0] = 2 * x[0][0] x[1][0] = 2 * x[1][0] x[2][0] = 2 * x[2][0] …… x[4999][0] = 2 * x[4999][0] x[0][1] = 2 * x[0][1] …… x[0][0] x[0][1] x[0][2] x[0][99] x[1][0] x[1][1] x[1][2] x[1][99] x[2][0] x[2][1] x[2][2] x[1][99]

  11. Compiler Optimizations • Data optimization (continued) • Loop interchange (continued) • New code: memory access with the stride of 1 word  improve spatial locality for(i=0; i<5000; i++) for(j=0; j<100; j++) x[i][j] = 2 * x[i][j]; • Execution order x[0][0] = 2 * x[0][0] x[0][1] = 2 * x[0][1] x[0][2] = 2 * x[0][2] …… x[0][99] = 2 * x[0][99] x[1][0] = 2 * x[1][0] …… x[0][0] x[0][1] x[0][2] x[0][99] x[1][0] x[1][1] x[1][2] x[1][99] x[2][0] x[2][1] x[2][2] x[1][99]

  12. Compiler Optimization • Blocking: maximizes accesses to the data loaded into the cache before the data are replaced • Ex) matrix multiplication for (i=0; i<N; i++) for (j=0; j<N; j++) {r=0; for (k=0; k<N; k++) r=r+y[i][k]*z[k][j]; x[i][j]=r; } • Data access pattern:  Fig. 5.21 • After blocking transform for (jj=0; jj<N; jj = jj+B) for (kk=0; kk<N; jj = kk+B) for (i=0; i<N; i++) for (j=jj; j<min(jj+B,N); j++) {r=0; for (k=kk; k<min(kk+B,N); k++) r=r+y[i][k]*z[k][j]; x[i][j]=r; } • Data access pattern:  Fig. 5.22

More Related