340 likes | 349 Views
Explore the history, challenges, and benefits of shared caches in multi-core systems. Learn about false sharing, maintaining private copies, and methods to delay reads until shared. Discover hybrid models and experimental frameworks for optimizing performance.
E N D
Thoughts on Shared Caches Jeff OdomUniversity of Maryland
A Brief History of Time • First there was the single CPU • Memory tuning new field • Large improvements possible • Life is good • Then came multiple CPUs • Rethink memory interactions • Life is good (again) • Now there’s multi-core on multi-CPU • Rethink memory interactions (again) • Life will be good (we hope)
SMP vs. CMP • Symmetric Multiprocessing (SMP) • Single CPU core per chip • All caches private to each CPU • Communication via main memory • Chip Multiprocessing (CMP) • Multiple CPU cores on one integrated circuit • Private L1 cache • Shared second-level and higher caches
CMP Features • Thread-level parallelism • One thread per core • Same as SMP • Shared higher-level caches • Reduced latency • Improved memory bandwidth • Non-homogeneous data decomposition • Not all cores are created equal
CMP Challenges • New optimizations • False sharing/private data copies • Delaying reads until shared • Fewer locations to cache data • More chance of data eviction in high-throughput computations • Hybrid SMP/CMP systems • Connect multiple multi-core nodes • Composite cache sharing scheme • Cray XT4 • 2 cores/chip • 2 chips/node
False Sharing • Occurs when two CPUs access different data structures on the same cache line
False Sharing (SMP vs. CMP) • With private L2 (SMP), modification of co-resident data structures results in trips to main memory • In CMP, false sharing impact is limited by the shared L2 • Latency from L1 to L2 much less than L2 to main memory
Maintaining Private Copies • Two threads modifying the same cache line will want to move data to their L1 • Simultaneous reading/modification causes thrashing between L1’s and L2 • Keeping a copy of data in separate cache line keeps data local to the processor • Updates to shared data occur less often
Delaying Reads Until Shared • Often the results from one thread are pipelined to another • Typical signal-based sharing: • Thread 1 accesses data, is pulled into L1T1 • T1 modifies data • T1 signals T2 that data is ready • T2 requests data, forcing eviction from L1T1 into L2Shared • Data is now shared • L1 line not filled in, wasting space
Delaying Reads Until Shared • Optimized sharing: • T1 pulls data into L1T1 as before • T1 modifies data • T1 waits until it has other data to fill the line with, then uses that to push data into L2Shared • T1 signals T2 that data is ready • T1 and T2 now share data in L2Shared • Eviction is side-effect of loading line
Hybrid Models • Most CMP systems will have SMP as well • Large core density not feasible • Want to balance processing with cache sizes • Different access patterns • Co-resident cores act different than cores of different nodes • Results may differ depending on which processor pairs you get
Experimental Framework • Simics simulator • Full system simulation • Hot-swappable components • Configurable memory system • Reconfigurable cache hierarchy • Roll-your-own coherency protocol • Simulated environment • SunFire 6800, Solaris 10 • Single CPU board, 4 UltraSPARC IIi • Uniform main memory access • Similar to actual hardware on hand
Experimental Workload • NAS Parallel Benchmarks • Well known, standard applications • Various data access patterns (conjugate gradient, multi-grid, etc.) • OpenMP-optimized • Already converted from original serial versions • MPI-based versions also available • Small (W) workloads • Simulation framework slows down execution • Will examine larger (A-C) versions to verify tool correctness
Some show marked improvement (CG)… …others show marginal improvement (FT)… …still others show asymmetrical loads (BT)… …and asymmetrical improvement (EP) Workload Results
The Next Step • How to get data and tools for programmers to deal with this? • Hardware • Languages • Analysis tools • Specialized hardware counters • Which CPU forced eviction • Are cores or nodes contending for data • Coherency protocol diagnostics
The Next Step • CMP-aware parallel languages • Language-based framework easier to perform automatic optimizations • OpenMP, UPC likely candidates • Specialized partitioning may be needed to leverage shared caches • Implicit data partitioning • Current languages distribute data uniformly • May require extensions (hints) in the form of language directives
The Next Step • Post-execution analysis tools • Identify memory hotspots • Provide hints on restructuring • Blocking • Execution interleaving • Convert SMP-optimized code for use in CMP • Dynamic instrumentation opportunities