Thoughts on Shared Caches

1. Thoughts on Shared Caches Jeff OdomUniversity of Maryland One of the common areas of performance tuning for supercomputing applications is the interaction of applications with the memory subsystems. Understanding the cache behavior of new algorithms has become an important part of the development lifecycle.One of the common areas of performance tuning for supercomputing applications is the interaction of applications with the memory subsystems. Understanding the cache behavior of new algorithms has become an important part of the development lifecycle.

2. 2 A Brief History of Time First there was the single CPU Memory tuning new field Large improvements possible Life is good Then came multiple CPUs Rethink memory interactions Life is good (again) Now there�s multi-core on multi-CPU Rethink memory interactions (again) Life will be good (we hope)

3. 3 SMP vs. CMP Symmetric Multiprocessing (SMP) Single CPU core per chip All caches private to each CPU Communication via main memory Chip Multiprocessing (CMP) Multiple CPU cores on one integrated circuit Private L1 cache Shared second-level and higher caches

4. 4 CMP Features Thread-level parallelism One thread per core Same as SMP Shared higher-level caches Reduced latency Improved memory bandwidth Non-homogeneous data decomposition Not all cores are created equal

5. 5 CMP Challenges New optimizations False sharing/private data copies Delaying reads until shared Fewer locations to cache data More chance of data eviction in high-throughput computations Hybrid SMP/CMP systems Connect multiple multi-core nodes Composite cache sharing scheme Cray XT4 2 cores/chip 2 chips/node

6. 6 False Sharing Occurs when two CPUs access different data structures on the same cache line

7. 7 False Sharing (SMP)



10. 10 False Sharing (SMP)





15. 15 False Sharing (CMP)








23. 23 False Sharing (SMP vs. CMP) With private L2 (SMP), modification of co-resident data structures results in trips to main memory In CMP, false sharing impact is limited by the shared L2 Latency from L1 to L2 much less than L2 to main memory

24. 24 Maintaining Private Copies Two threads modifying the same cache line will want to move data to their L1 Simultaneous reading/modification causes thrashing between L1�s and L2 Keeping a copy of data in separate cache line keeps data local to the processor Updates to shared data occur less often

25. 25 Delaying Reads Until Shared Often the results from one thread are pipelined to another Typical signal-based sharing: Thread 1 accesses data, is pulled into L1T1 T1 modifies data T1 signals T2 that data is ready T2 requests data, forcing eviction from L1T1 into L2Shared Data is now shared L1 line not filled in, wasting space

26. 26 Delaying Reads Until Shared Optimized sharing: T1 pulls data into L1T1 as before T1 modifies data T1 waits until it has other data to fill the line with, then uses that to push data into L2Shared T1 signals T2 that data is ready T1 and T2 now share data in L2Shared Eviction is side-effect of loading line

27. 27 Hybrid Models Most CMP systems will have SMP as well Large core density not feasible Want to balance processing with cache sizes Different access patterns Co-resident cores act different than cores of different nodes Results may differ depending on which processor pairs you get

28. 28 Experimental Framework Simics simulator Full system simulation Hot-swappable components Configurable memory system Reconfigurable cache hierarchy Roll-your-own coherency protocol Simulated environment SunFire 6800, Solaris 10 Single CPU board, 4 UltraSPARC IIi Uniform main memory access Similar to actual hardware on hand

29. 29 Experimental Workload NAS Parallel Benchmarks Well known, standard applications Various data access patterns (conjugate gradient, multi-grid, etc.) OpenMP-optimized Already converted from original serial versions MPI-based versions also available Small (W) workloads Simulation framework slows down execution Will examine larger (A-C) versions to verify tool correctness

30. 30 Workload Results Some show marked improvement (CG)� �others show marginal improvement (FT)� �still others show asymmetrical loads (BT)� �and asymmetrical improvement (EP)

31. 31 The Next Step How to get data and tools for programmers to deal with this? Hardware Languages Analysis tools Specialized hardware counters Which CPU forced eviction Are cores or nodes contending for data Coherency protocol diagnostics

32. 32 The Next Step CMP-aware parallel languages Language-based framework easier to perform automatic optimizations OpenMP, UPC likely candidates Specialized partitioning may be needed to leverage shared caches Implicit data partitioning Current languages distribute data uniformly May require extensions (hints) in the form of language directives

33. 33 The Next Step Post-execution analysis tools Identify memory hotspots Provide hints on restructuring Blocking Execution interleaving Convert SMP-optimized code for use in CMP Dynamic instrumentation opportunities

34. 34 Questions?

Thoughts on Shared Caches

Thoughts on Shared Caches

Presentation Transcript

Thoughts on AI

Thoughts on Context

Caches

Thoughts on Treebanks

Caches

Caches

Caches

Caches

Thoughts on Homelessness

Caches

Caches

Thoughts on Loneliness

Adaptive Insertion Policies for Managing Shared Caches

Utility-Based Partitioning of Shared Caches

Caches

Caches

Thoughts on Teaching

Thoughts on Collaboration

Utility-Based Partitioning of Shared Caches

Thoughts on Shared Caches

Improving on Caches

Random Thoughts On