Chapter 5 Memory III

Chapter 5Memory III CSE 820

Miss Rate Reduction (cont’d) Michigan State University Computer Science and Engineering

Larger Block Size • Reduces compulsory missesthrough spatial locality • But, • miss penalty increases:higher bandwidth helps • miss rate can increase:fixed cache size + larger blocksmeans fewer blocks in the cache Michigan State University Computer Science and Engineering

Notice the “U” shape: some is good, too much is bad. Michigan State University Computer Science and Engineering

Larger Caches • Reduces capacity misses • But • Increased hit time • Increased cost ($) • Over time, L2 and higher cache size increases Michigan State University Computer Science and Engineering

Higher Associativity • Reduces miss rates with fewer conflicts • But • Increased hit time (tag check) • Note • An 8-way associative cache has close to the same miss rate as fully associative Michigan State University Computer Science and Engineering

Way Prediction Predict which way of a L1 cache will be accessed next • Alpha 21264 correct prediction is 1 cycleincorrect prediction is 3 cycles • SPEC95 prediction is 85% correct Michigan State University Computer Science and Engineering

Compiler Techniques • Reduce conflicts in I-cache: 1989 study showed reduced misses by 50% for a 2KB cache and by 75% for an 8KB cache • D-cache performs differently Michigan State University Computer Science and Engineering

Compiler data optimizations Loop Interchange • Before for (j = … for (i = … x[i][j] = 2 * x[i][j] • After for (i = … for (j = … x[i][j] = 2 * x[i][j] • Improved Spatial Locality Michigan State University Computer Science and Engineering

Before After Blocking: Improve Spatial Locality Michigan State University Computer Science and Engineering

Miss Rate and Miss Penalty Reduction via Parallelism Michigan State University Computer Science and Engineering

Nonblocking Caches • Reduces stalls on cache miss • A blocking cache refuses all requests while waiting for data • A nonblocking cache continues to handle other requests while waiting for data on another request • Increases cache controller complexity Michigan State University Computer Science and Engineering

NonBlocking Cache (8K direct L1; 32 byte blocks) Michigan State University Computer Science and Engineering

Hardware Prefetch • Fetch two blocks: desired + next • “Next” goes into “stream buffer”on fetch check stream buffer first • Performance • Single-instruction stream buffercaught 15% to 25% of L1 misses • 4-instruction stream buffer caught 50% • 16-instruction stream buffer caught 72% Michigan State University Computer Science and Engineering

Hardware Prefetch • Data prefetch • Single-data stream buffercaught 25% of L1 misses • 4-data stream buffer caught 43% • 8-data stream buffers caught 50% to 70% • Prefetch from multiple addresses • UltraSPARCIII handles 8 prefetchescalculates “stride” for next prediction Michigan State University Computer Science and Engineering

Software Prefetch • Many processors such as Itanium have prefetch instructions • Remember they are nonfaulting Michigan State University Computer Science and Engineering

Hit Time Reduction Michigan State University Computer Science and Engineering

Small, Simple Caches • Time • Indexing • Comparing tag • Small  indexing is fast • Simple  direct allows tag comparison in parallel with data load •  L2 with tag on chip with data off chip Michigan State University Computer Science and Engineering

Time vs cache size & organization Michigan State University Computer Science and Engineering

Perspective on previous graph Same: • 1ns clock is 10-9 sec/clockCycle • 1 GHz is 109 clockCycles/sec Therefore, • 2ns clock is 500 MHz • 4ns clock is 250 MHz Conclude that small differences in nsrepresents a large difference in MHz Michigan State University Computer Science and Engineering

Virtual vs Physical Address in L1 • Translating from virtual address to physical address as part of cache access takes time on critical path • Translation is needed for both index and tag • Making the common case fast suggests avoiding translation for hits (misses must be translated) Michigan State University Computer Science and Engineering

Why are L1 caches physical?(almost all) • Security (Protection): page-level protection must be checked on access(protection data can be copied into cache) • Process switch can change virtual mapping requiring cache flush(or Process ID) [see next slide] • Synonyms: two virtual addresses for same (shared) physical address Michigan State University Computer Science and Engineering

Virtually-addressed cache context-switch cost Michigan State University Computer Science and Engineering

Hybrid: virtually indexed, physically tagged Index with the part of the page offset that is identical in virtual and physical addresses i.e. the index bits are a subset of the page-offset bits In parallel with indexing, translate the virtual address to check the physical tag Limitation: direct-mapped cache ≤ page size (determined by address bits) set-associative caches can be bigger since fewer bits are needed for index Michigan State University Computer Science and Engineering

Example • Pentium III • 8 KB pages with 16KB 2-way set-associative cache • IBM 3033 • 4KB pageswith 64KB 16-way set-associative cache(note that 8-way is sufficient, but 16-way is needed to keep index bits sufficiently small) Michigan State University Computer Science and Engineering

Trace Cache • Pentium 4 NetBurst architecture • I-cache blocks are organized to contain instruction traces including predicted taken branchesinstead of organized around memory addresses • Advantage over regular large cache blocks which contain branches and, hence, many unused instructionse.g. AMD Athlon 64-byte blocks contain 16-24 x86 instructions with 1-in-5 being branches • Disadvantage: complex addressing Michigan State University Computer Science and Engineering

Trace Cache • P4 trace cache (I-cache) is placed after decode and branch predictso it contains • μops • only desired instructions • Trace cache contains 12K μops • Branch predict BTB is 4K(33% improvement over PIII) Michigan State University Computer Science and Engineering

Michigan State University Computer Science and Engineering

Summary (so far) • Figure 5.26 summarizes all Michigan State University Computer Science and Engineering

Main-memory Main-memory modifications can help cache miss penalty by bringing words faster from memory • Wider path to memory brings in more words at a time, e.g. one address request brings in 4 words (reduces overhead) • Interleaved memory can allow memory to respond faster Michigan State University Computer Science and Engineering

Chapter 5 Memory III

Chapter 5 Memory III

Presentation Transcript

Chapter 5 Internal Memory

Chapter 5 Memory

Chapter 5 Internal Memory

CHAPTER 5 INTERNAL MEMORY

CHAPTER 5 INTERNAL MEMORY

Memory Management - III

Chapter 5 Internal Memory

Chapter 5 Internal Memory

Chapter 5 Memory Management ---- Memory Segment

Chapter 5 Memory

Chapter 5 Memory

Chapter 5 Memory

Accessing Memory Chapter 5

Chapter 5 Memory Management

Pentium III Memory

Chapter 5 Memory

Chapter 5 Internal Memory

Memory Management Chapter 5

Memory Management Chapter 5

Chapter 5 Internal Memory

Chapter 5 Internal Memory

Chapter 5 Memory