300 likes | 477 Views
EECS 470. Cache Systems Lecture 13 Coverage: Chapter 5. L1 Cache (several KB). L2 Cache (½-32MB). Memory (128MB – fewGB). Disk (Many GB). Cache Design 101. Memory pyramid. Reg 100s bytes. 1 cycle access (early in pipeline). 1-3 cycle access. 6-15 cycle access. 50-300 cycle access.
E N D
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5
L1 Cache (several KB) L2 Cache (½-32MB) Memory (128MB – fewGB) Disk (Many GB) Cache Design 101 Memory pyramid Reg 100s bytes 1 cycle access (early in pipeline) 1-3 cycle access 6-15 cycle access 50-300 cycle access Millions cycle access!
Direct-mapped cache Memory Address 01101 Cache 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 V d tag data 78 23 29 218 0 120 10 0 123 44 0 71 16 0 150 141 162 28 173 214 Block Offset (1-bit) 18 33 21 98 Line Index (2-bit) 33 181 28 129 Tag (2-bit) 19 119 200 42 Compulsory Miss: first reference to memory block Capacity Miss: Working set doesn’t fit in cache Conflict Miss: Working set maps to same cache line 210 66 225 74
2-way set associative cache Memory Address 01101 Cache 00000 00010 00100 00110 01000 01010 01100 01110 10000 10010 10100 10110 11000 11010 11100 11110 V d tag data 78 23 29 218 0 120 10 0 123 44 0 71 16 0 150 141 162 28 173 214 Block Offset (unchanged) 18 33 21 98 1-bit Set Index 33 181 28 129 Larger (3-bit) Tag 19 119 200 42 Rule of thumb: Increasing associativity decreases conflict misses. A 2-way associative cache has about the same hit rate as a direct mapped cache twice the size. 210 66 225 74
Effects of Varying Cache Parameters • Total cache size: block size # sets associativity • Positives: • Should decrease miss rate • Negatives: • May increase hit time • Increased area requirements
Effects of Varying Cache Parameters • Bigger block size • Positives: • Exploit spatial locality ; reduce compulsory misses • Reduce tag overhead (bits) • Reduce transfer overhead (address, burst data mode) • Negatives: • Fewer blocks for given size; increase conflict misses • Increase miss transfer time (multi-cycle transfers) • Wasted bandwidth for non-spatial data
Effects of Varying Cache Parameters • Increasing associativity • Positives: • Reduces conflict misses • Low-assoc cache can have pathological behavior (very high miss) • Negatives: • Increased hit time • More hardware requirements (comparators, muxes, bigger tags) • Decreased improvements past 4- or 8- way.
Effects of Varying Cache Parameters • Replacement Strategy: (for associative caches) • LRU: intuitive; difficult to implement with high assoc; worst case performance can occur (N+1 element array) • Random: Pseudo-random easy to implement; performance close to LRU for high associvity • Optimal: replace block that has next reference farthest in the future; hard to implement
Other Cache Design Decisions • Write Policy: How to deal with write misses? • Write-through / no-allocate • Total traffic? Read misses block size + writes • Common for L1 caches back by L2 (esp. on-chip) • Write-back / write-allocate • Needs a dirty bit to determine whether cache data differs • Total traffic? (read misses + write misses) block size + dirty-block-evictions block size • Common for L2 caches (memory bandwidth limited) • Variation: Write validate • Write-allocate without fetch-on-write • Needs sub-block cache with valid bits for each word/byte
Other Cache Design Decisions • Write Buffering • Delay writes until bandwidth available • Put them in FIFO buffer • Only stall on write if buffer is full • Use bandwidth for reads first (since they have latency problems) • Important for write-through caches since write traffic is frequent • Write-back buffer • Holds evicted (dirty) lines for Write-back caches • Also allows reads to have priority on the L2 or memory bus. • Usually only needs a small buffer Ref: Eager Writeback Caches
1101001 110 Adding a Victim cache V d tag data (Direct mapped) V d tag data (fully associative) 0 0 0 0 0 0 0 0 Victim cache (4 lines) 0 0 0 Ref: 11010011 Ref: 01010011 0 0 0 010 0 • Small victim cache adds associativity to “hot” lines • Blocks evicted from direct-mapped cache go to victim • Tag compares are made to direct mapped and victim • Victim hits cause lines to swap from L1 and victim • Not very useful for associative L1 caches 0 0 0 0 0
Hash-Rehash Cache V d tag data (Direct mapped) 0 11010011 01010011 11010011 0 0 0 0 0 0 0 0 0 110 0 0 0 0 0 0
110 Hash-Rehash Cache V d tag data (Direct mapped) 0 11010011 01010011 01000011 Allocate? 11010011 0 Miss Rehash miss 0 0 0 0 0 0 0 0 0 0 0 0 0 0
R 110 010 Hash-Rehash Cache V d tag data (Direct mapped) 0 11010011 01010011 01000011 11010011 0 Miss Rehash miss 0 0 0 0 0 0 0 0 0 0 0 0 0 0
R 110 010 Hash-Rehash Cache V d tag data (Direct mapped) 0 11010011 01010011 01000011 11010011 11000011 0 0 0 0 0 0 0 Miss Rehash Hit! 0 0 0 0 0 0 0 0
Hash-Rehash Cache • Calculating performance: • Primary hit time (normal Direct mapped) • Rehash hit time (sequential tag lookups) • Block swap time? • Hit rate comparable to 2-way associative.
Compiler support for caching • Array Merging (array of structs vs. 2 arrays) • Loop interchange (row vs. column access) • Structure padding and alignment (malloc) • Cache conscious data placement • Pack working set into same line • Map to non-conflicting address is packing impossible
Prefetching • Already done – bring in an entire line assuming spatial locality • Extend this… Next Line Prefetch • Bring in the next block in memory as well a miss line (very good for Icache) • Software prefetch • Loads to R0 have no data dependency • Aggressive/speculative prefetch useful for L2 • Speculative prefetch problematic for L1
Calculating the Effects of Latency • Does a cache miss reduce performance? • It depends on whether there are critical instructions waiting for the result
Calculating the Effects of Latency • It depends on whether critical resources are held up • Blocking: When a miss occurs, all later reference to the cache must wait. This is a resource conflict. • Non-blocking: Allows later references to access cache while miss is being processed. • Generally there is some limit to how many outstanding misses can be bypassed.
P4 Overview (Todd’s slides) • Latest iA32 processor from Intel • Equipped with the full set of iA32 SIMD operations • First flagship architecture since the P6 microarchitecture • Pentium 4 ISA = Pentium III ISA + SSE2 • SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch
Trace Cache • Primary instruction cache in P4 architecture • Stores 12k decoded mops • On a miss, instructions are fetched from L2 • Trace predictor connects traces • Trace cache removes • Decode latency after mispredictions • Decode power for all pre-decoded instructions
Store and Load Scheduling • Out of order store and load operations Stores are always in program order • 48 loads and 24 stores could be in flight • Store/load buffers are allocated at the allocation stage • Total 24 store buffers and 48 load buffers
On-chip Caches • L1 instruction cache (Trace Cache) • L1 data cache • L2 unified cache • All caches use a pseudo-LRU replacement algorithm • Parameters:
L1 Data Cache • Non-blocking • Support up to 4 outstanding load misses • Load latency • 2-clock for integer • 6-clock for floating-point • 1 Load and 1 Store per clock • Load speculation • Assume the access will hit the cache • “Replay” the dependent instructions when miss detected
L2 Cache • Non-blocking • Load latency • Net load access latency of 7 cycles • Bandwidth • 1 load and 1 store in one cycle • New cache operations may begin every 2 cycles • 256-bit wide bus between L1 and L2 • 48Gbytes per second @ 1.5GHz
L2 Cache Data Prefetcher • Hardware prefetcher monitors the reference patterns • Bring cache lines automatically • Attempts to fetch 256 bytes ahead of current access • Prefetch for up to 8 simultaneous independent streams