190 likes | 274 Views
Optimizing Replication, Communication, and Capacity Allocation in CMPs. Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey. Published in Proceedings of the 32nd International Symposium on Computer Architecture, pages 357-368, June 2005. Motivation.
E N D
Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published in Proceedings of the 32nd International Symposium on Computer Architecture, pages 357-368, June 2005.
Motivation • Emerging trend for CMPs • New Challenges in Cache design policies • Increased capacity pressure on the on-chip memory- Need for large on chip capacity for multiple cores • Increased cache latencies in large caches- Wire delays Need for a cache design that tackles these challenges
Cache Organization • Goal: • Utilize Capacity Effectively- Reduce capacity misses • Mitigate Increased Latencies- Keep wire delays small • Shared • High Capacity but increased latency • Private • Low Latency but limited capacity Neither private nor shared caches provide both goals
Latency-Capacity Tradeoff • SMPs and DSMs have same goals in terms of cache design • Capacity • CMPs have limited on-chip memories • SMPs have large off-chip memories • Latency of accesses • SMPs have slow off-chip access • CMPs have fast on-chip access CMPs change Latency-Capacity Tradeoff in two ways
Novel Mechanisms • Controlled Replication • Avoid copies for some read-only shared data • In-Situ Communication • Use fast on-chip communication to avoid coherence miss of read-write-shared data • Capacity Stealing • Allow a core to steal another core’s unused capacity • Hybrid cache • Private Tag Array and Shared Data Array • CMP-NuRAPID(Non-Uniform access with Replacement and Placement using Distance associativity) • Performance • CMP-NuRAPID improves performance by 13% over a shared cache and 8% over a private cache for three commercial multithreaded workloads Three novel mechanisms to exploit the changes in Latency-Capacity tradeoff
CMP-NuRAPID • Non-Uniform Access and Distance Associativity • Caches divided into d-groups • D-group preference 4-core CMP with CMP-NuRAPID
CMP-NuRAPID Organization Data Array Tag Arrays CMP NuRAPID Tag and Data Arrays
CMP-NuRAPID Organization • Private Tag Array • Shared Data Array • Leverages forward and reverse pointers Single copy of block shared by multiple tags Data for one core in different d-groups Extra Level of Indirection for novel mechanisms
Mechanisms • Controlled Replication • In-Situ Communication • Capacity Stealing
Controlled Replication • On a read miss- Updates tag pointer to point to the already-on-chip block • On a subsequent read-Data copy is made in the reader’s closest d-group to avoid slow accesses in future
Mechanisms • Controlled Replication • In-Situ Communication • Capacity Stealing
In-Situ Communication • Enforce single copy of read-write shared block in L2 and keep the block in communication (C) state Replace M to S transition by M to C transition Fast communication with capacity savings
Mechanisms • Controlled Replication • In-Situ Communication • Capacity Stealing
Capacity Stealing • Demotion: Demote less frequently used data to un-used frames in d-groups closer to core with less capacity demands. • Promotion: if tag hit occurs on a block in farther d-group promote it Data for one core in different d-groups Use of unused capacity in a neighboring core
Methodology • Full-system simulation of 4-core CMP using Simics • CMP NuRAPID: 8 MB, 8-way • 4 d-groups,1-port for each tag array and data d-group • Compare to • Private 2 MB, 8-way, 1-port per core • CMP-SNUCA: Shared with non-uniform-access, no replication
Results Multi-Threaded Workloads Multi-programmed Workloads
Conclusions • CMPs change the Latency Capacity tradeoff • Controlled Replication, In-Situ Communication and Capacity Stealing are novel mechanisms to exploi the change in the Latency-Capacity tradeoff • CMP-NuRAPID is a hybrid cache that uses incorporates the novel mechanisms • For commercial multi-threaded workloads– 13% better than shared, 8% better than private • For multi-programmed workloads– 28% better than shared, 8% better than private
Thank you Questions?