An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches

An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches ASPLOS’02 Presented by Kim, Sun-Hee

Introduction • Technology trends • The rate of frequency scaling is slowing down • Performance must come from exploiting concurrency • Increasing global on-chip wire delay problem • Architectures must be partitioned • NUCA (Non-Uniform access Cache Architecture) • Composable on-chip memories • Address the increasing wire delay problem in future large caches • Array of fine-grained memory banks connected by a switched network

Level-2 Cache Architectures(1/5) • UCA (Uniform Cache Access) • Traditional cache • Poor performance • Internal wire delays • Restricted numbers of ports

Level-2 Cache Architectures(2/5) • ML-UCA (Multi-level Cache) • L2 and L3 • Aggressively baked • Multiple parallel access • Inclusion, replicating

Level-2 Cache Architectures(3/5) • S-NUCA-1 (Static Non-Uniform Cache) • Non-uniform access without inclusion • Mapping is predetermined • Based on the block index • Only one bank of the cache • Private, two-way, pipelined transmission channel

Level-2 Cache Architectures(4/5) • S-NUCA-2 • 2D switched network • Permitting a larger number of smaller, faster banks • Circumvent wire & decoder area overhead

Level-2 Cache Architectures(5/5) • D-NUCA (Dynamic NUCA) • Migrating cache lines • By data to be mapped to many banks • Most requests are serviced by the fastest banks • Fewer misses • By adopting to the working set

UCA • Experimental Methodology • Cacti to derive the access times for cache • sim-alpha to simulate cache performance • UCA Evaluation

S-NUCA • Mappings of data to banks are static • Low-order bits index determine bank • Four-way set associative • Advantages • Different access time proportional to the distance of the bank • Access to different banks may in parallel • Reducing contention

S-NUCA-1 (Private Channel) • 2 private, per-bank 128-bit channels • Each bank access independently at max speed • Small bank advantages Vs. area overheads • Bank conflict contention model • Conservative policy : b+2d+3 cycles • Aggressive pipelining policy : b+3 cycles

S-NUCA-2 (Switched Channel) • Lightweight, wormhole-routed 2-D mesh • Centralized tag store or broadcasting the tags to all of the banks

D-NUCA : Mapping • Spread sets • The multibanked cache as a set-associative • Bank set Bank set, 4-way Rows# may not ways Different latencies Equal latencies Complex path in a set Potential longer latencies More contention Fastest bank access

D-NUCA : Locating • Incremental search • From the closest bank • Minimize messages, low energy and performance • Multicast search • Multicast address to banks in a set • Higher performance at more energy and contention • Limited multicast • Search first M banks in parallel then incremental • Partitioned multicast • Subset in bank set is searched iteratively

D-NUCA : Searching • Challenges in distributed cache array • Many banks may need to be searched • Miss resolution time grows as way increase • Partial tag comparison • Reduce bank lookups and miss resolution time • Smart search • Stores the partial tag bits in the cache controller • ss-performance : enough tag bits reducing false hit • ss-energy : serialized search from the closest bank

D-NUCA : Movement • Maximize the hit ratio in the closest bank • MRU line is in the closest bank • Generational promotion • Approximating an LRU mapping • Reduce the copying # by pure LRU • On hit, swapped with the line in the next closest bank • Zero-copy policy, one-copy policy

D-NUCA : Policies • Mapping • Simple or shared • Search • Multicast, incremental, or combination • Promotion • Promotion distance(1bank), promotion trigger(1hit) • Insertion • Location (slowest bank) and replacement (zero copy) • Compare to pure LRU

Evaluations (1/2) UCA : 67.7 ML-UCA : 22.3 S-NUCA : 30.4 UCA : 0.41 S-NUCA : 0.65

Evaluations (2/2) • Comparison to ML-UCA • Same with D-NUCA in frequently used data is closer Working set > 2MB

Summary and Conclusions • Low latency access • Technology scalability • Performance stability • Flattening the memory hierarchy

Evaluations (2/)

Evaluations (3/3) • Cache Design Comparison

An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches

An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches

Presentation Transcript

Adaptive Cache Compression for High-Performance Processors

Outperforming LRU with an Adaptive Replacement Cache Algorithm

The Locality-Aware Adaptive Cache Coherence Protocol

Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs

Prefetching with Adaptive Cache Culling for Striped Disk Arrays

Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs

Cache Tables: Paving the way for an Adaptive Database Cache

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches

Cache Replacement Algorithms with Nonuniform Miss Costs

ARC (Adaptive Replacement Cache)

An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing

Cache Decay: Mechanisms to Reduce Leakage Power in Caches

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

ASR: Adaptive Selective Replication for CMP Caches

Adaptive Insertion Policies for Managing Shared Caches

Achieving Non-Inclusive Cache Performance with Inclusive Caches

An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing

An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for CMPs

ASR: Adaptive Selective Replication for CMP Caches