140 likes | 293 Views
CS 7810 Lecture 17. Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004. Cache Design. Data Array. Tag Array. D E C O D E R. D E C O D E R. Address. Comparator. Sense Amp. Mux+driver. Data. Capacity Vs. Latency. 8KB
E N D
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004
Cache Design Data Array Tag Array D E C O D E R D E C O D E R Address Comparator Sense Amp Mux+driver Data
Capacity Vs. Latency 8KB 1 cycle 32 KB 2 cycles 128 KB 3 cycles
Large L2 Caches • Issues to be addressed for • Non-Uniform Cache Access: • Mapping • Searching • Movement CPU
Dynamic NUCA • Frequently accessed blocks are moved closer to • CPU – reduces average latency • Partial (6-bit) tags are maintained close to CPU – • tag look-up can identify potential location of block • or quickly signal a miss • Without partial tags, every possible location would • have to be searched serially or in parallel • What if you optimize for power?
DNUCA – CMP Latency 65 cyc Allocation: static, based on block’s address Migration: r.l r.i r.c m.c m.i m.l Search: multicast to 6; then multicast to 10 False misses Latency 13-17cyc
Alternative Layout From Huh et al., ICS’05
Block Migration Results While block migration reduces avg. distance, it complicates search.
CMP-TLC Pros: Fast wires enable uniform low-latency access Cons: Low-bandwidth interconnect High implementation cost More latency/complexity at the L2 interface
Stride Prefetching • Prefetching algorithm: detect at least 4 uniform stride accesses and then • allocate an entry in stream buffer • Stream buffer has 8 entries and each stream stays 6 (L1) or 25 (L2) • accesses ahead
Title • Bullet