300 likes | 320 Views
Chapter 5 Memory III. CSE 820. Miss Rate Reduction (cont’d). Larger Block Size. Reduces compulsory misses through spatial locality But, miss penalty increases: higher bandwidth helps miss rate can increase: fixed cache size + larger blocks means fewer blocks in the cache.
E N D
Chapter 5Memory III CSE 820
Miss Rate Reduction (cont’d) Michigan State University Computer Science and Engineering
Larger Block Size • Reduces compulsory missesthrough spatial locality • But, • miss penalty increases:higher bandwidth helps • miss rate can increase:fixed cache size + larger blocksmeans fewer blocks in the cache Michigan State University Computer Science and Engineering
Notice the “U” shape: some is good, too much is bad. Michigan State University Computer Science and Engineering
Larger Caches • Reduces capacity misses • But • Increased hit time • Increased cost ($) • Over time, L2 and higher cache size increases Michigan State University Computer Science and Engineering
Higher Associativity • Reduces miss rates with fewer conflicts • But • Increased hit time (tag check) • Note • An 8-way associative cache has close to the same miss rate as fully associative Michigan State University Computer Science and Engineering
Way Prediction Predict which way of a L1 cache will be accessed next • Alpha 21264 correct prediction is 1 cycleincorrect prediction is 3 cycles • SPEC95 prediction is 85% correct Michigan State University Computer Science and Engineering
Compiler Techniques • Reduce conflicts in I-cache: 1989 study showed reduced misses by 50% for a 2KB cache and by 75% for an 8KB cache • D-cache performs differently Michigan State University Computer Science and Engineering
Compiler data optimizations Loop Interchange • Before for (j = … for (i = … x[i][j] = 2 * x[i][j] • After for (i = … for (j = … x[i][j] = 2 * x[i][j] • Improved Spatial Locality Michigan State University Computer Science and Engineering
Before After Blocking: Improve Spatial Locality Michigan State University Computer Science and Engineering
Miss Rate and Miss Penalty Reduction via Parallelism Michigan State University Computer Science and Engineering
Nonblocking Caches • Reduces stalls on cache miss • A blocking cache refuses all requests while waiting for data • A nonblocking cache continues to handle other requests while waiting for data on another request • Increases cache controller complexity Michigan State University Computer Science and Engineering
NonBlocking Cache (8K direct L1; 32 byte blocks) Michigan State University Computer Science and Engineering
Hardware Prefetch • Fetch two blocks: desired + next • “Next” goes into “stream buffer”on fetch check stream buffer first • Performance • Single-instruction stream buffercaught 15% to 25% of L1 misses • 4-instruction stream buffer caught 50% • 16-instruction stream buffer caught 72% Michigan State University Computer Science and Engineering
Hardware Prefetch • Data prefetch • Single-data stream buffercaught 25% of L1 misses • 4-data stream buffer caught 43% • 8-data stream buffers caught 50% to 70% • Prefetch from multiple addresses • UltraSPARCIII handles 8 prefetchescalculates “stride” for next prediction Michigan State University Computer Science and Engineering
Software Prefetch • Many processors such as Itanium have prefetch instructions • Remember they are nonfaulting Michigan State University Computer Science and Engineering
Hit Time Reduction Michigan State University Computer Science and Engineering
Small, Simple Caches • Time • Indexing • Comparing tag • Small indexing is fast • Simple direct allows tag comparison in parallel with data load • L2 with tag on chip with data off chip Michigan State University Computer Science and Engineering
Time vs cache size & organization Michigan State University Computer Science and Engineering
Perspective on previous graph Same: • 1ns clock is 10-9 sec/clockCycle • 1 GHz is 109 clockCycles/sec Therefore, • 2ns clock is 500 MHz • 4ns clock is 250 MHz Conclude that small differences in nsrepresents a large difference in MHz Michigan State University Computer Science and Engineering
Virtual vs Physical Address in L1 • Translating from virtual address to physical address as part of cache access takes time on critical path • Translation is needed for both index and tag • Making the common case fast suggests avoiding translation for hits (misses must be translated) Michigan State University Computer Science and Engineering
Why are L1 caches physical?(almost all) • Security (Protection): page-level protection must be checked on access(protection data can be copied into cache) • Process switch can change virtual mapping requiring cache flush(or Process ID) [see next slide] • Synonyms: two virtual addresses for same (shared) physical address Michigan State University Computer Science and Engineering
Virtually-addressed cache context-switch cost Michigan State University Computer Science and Engineering
Hybrid: virtually indexed, physically tagged Index with the part of the page offset that is identical in virtual and physical addresses i.e. the index bits are a subset of the page-offset bits In parallel with indexing, translate the virtual address to check the physical tag Limitation: direct-mapped cache ≤ page size (determined by address bits) set-associative caches can be bigger since fewer bits are needed for index Michigan State University Computer Science and Engineering
Example • Pentium III • 8 KB pages with 16KB 2-way set-associative cache • IBM 3033 • 4KB pageswith 64KB 16-way set-associative cache(note that 8-way is sufficient, but 16-way is needed to keep index bits sufficiently small) Michigan State University Computer Science and Engineering
Trace Cache • Pentium 4 NetBurst architecture • I-cache blocks are organized to contain instruction traces including predicted taken branchesinstead of organized around memory addresses • Advantage over regular large cache blocks which contain branches and, hence, many unused instructionse.g. AMD Athlon 64-byte blocks contain 16-24 x86 instructions with 1-in-5 being branches • Disadvantage: complex addressing Michigan State University Computer Science and Engineering
Trace Cache • P4 trace cache (I-cache) is placed after decode and branch predictso it contains • μops • only desired instructions • Trace cache contains 12K μops • Branch predict BTB is 4K(33% improvement over PIII) Michigan State University Computer Science and Engineering
Michigan State University Computer Science and Engineering
Summary (so far) • Figure 5.26 summarizes all Michigan State University Computer Science and Engineering
Main-memory Main-memory modifications can help cache miss penalty by bringing words faster from memory • Wider path to memory brings in more words at a time, e.g. one address request brings in 4 words (reduces overhead) • Interleaved memory can allow memory to respond faster Michigan State University Computer Science and Engineering