220 likes | 382 Views
An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches. ASPLOS’02 Presented by Kim, Sun- Hee. Introduction. Technology trends The rate of frequency scaling is slowing down Performance must come from exploiting concurrency
E N D
An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches ASPLOS’02 Presented by Kim, Sun-Hee
Introduction • Technology trends • The rate of frequency scaling is slowing down • Performance must come from exploiting concurrency • Increasing global on-chip wire delay problem • Architectures must be partitioned • NUCA (Non-Uniform access Cache Architecture) • Composable on-chip memories • Address the increasing wire delay problem in future large caches • Array of fine-grained memory banks connected by a switched network
Level-2 Cache Architectures(1/5) • UCA (Uniform Cache Access) • Traditional cache • Poor performance • Internal wire delays • Restricted numbers of ports
Level-2 Cache Architectures(2/5) • ML-UCA (Multi-level Cache) • L2 and L3 • Aggressively baked • Multiple parallel access • Inclusion, replicating
Level-2 Cache Architectures(3/5) • S-NUCA-1 (Static Non-Uniform Cache) • Non-uniform access without inclusion • Mapping is predetermined • Based on the block index • Only one bank of the cache • Private, two-way, pipelined transmission channel
Level-2 Cache Architectures(4/5) • S-NUCA-2 • 2D switched network • Permitting a larger number of smaller, faster banks • Circumvent wire & decoder area overhead
Level-2 Cache Architectures(5/5) • D-NUCA (Dynamic NUCA) • Migrating cache lines • By data to be mapped to many banks • Most requests are serviced by the fastest banks • Fewer misses • By adopting to the working set
UCA • Experimental Methodology • Cacti to derive the access times for cache • sim-alpha to simulate cache performance • UCA Evaluation
S-NUCA • Mappings of data to banks are static • Low-order bits index determine bank • Four-way set associative • Advantages • Different access time proportional to the distance of the bank • Access to different banks may in parallel • Reducing contention
S-NUCA-1 (Private Channel) • 2 private, per-bank 128-bit channels • Each bank access independently at max speed • Small bank advantages Vs. area overheads • Bank conflict contention model • Conservative policy : b+2d+3 cycles • Aggressive pipelining policy : b+3 cycles
S-NUCA-2 (Switched Channel) • Lightweight, wormhole-routed 2-D mesh • Centralized tag store or broadcasting the tags to all of the banks
D-NUCA : Mapping • Spread sets • The multibanked cache as a set-associative • Bank set Bank set, 4-way Rows# may not ways Different latencies Equal latencies Complex path in a set Potential longer latencies More contention Fastest bank access
D-NUCA : Locating • Incremental search • From the closest bank • Minimize messages, low energy and performance • Multicast search • Multicast address to banks in a set • Higher performance at more energy and contention • Limited multicast • Search first M banks in parallel then incremental • Partitioned multicast • Subset in bank set is searched iteratively
D-NUCA : Searching • Challenges in distributed cache array • Many banks may need to be searched • Miss resolution time grows as way increase • Partial tag comparison • Reduce bank lookups and miss resolution time • Smart search • Stores the partial tag bits in the cache controller • ss-performance : enough tag bits reducing false hit • ss-energy : serialized search from the closest bank
D-NUCA : Movement • Maximize the hit ratio in the closest bank • MRU line is in the closest bank • Generational promotion • Approximating an LRU mapping • Reduce the copying # by pure LRU • On hit, swapped with the line in the next closest bank • Zero-copy policy, one-copy policy
D-NUCA : Policies • Mapping • Simple or shared • Search • Multicast, incremental, or combination • Promotion • Promotion distance(1bank), promotion trigger(1hit) • Insertion • Location (slowest bank) and replacement (zero copy) • Compare to pure LRU
Evaluations (1/2) UCA : 67.7 ML-UCA : 22.3 S-NUCA : 30.4 UCA : 0.41 S-NUCA : 0.65
Evaluations (2/2) • Comparison to ML-UCA • Same with D-NUCA in frequently used data is closer Working set > 2MB
Summary and Conclusions • Low latency access • Technology scalability • Performance stability • Flattening the memory hierarchy
Evaluations (3/3) • Cache Design Comparison