330 likes | 543 Views
2. A Brief History of Time. First there was the single CPUMemory tuning new fieldLarge improvements possibleLife is goodThen came multiple CPUsRethink memory interactionsLife is good (again)Now there's multi-core on multi-CPURethink memory interactions (again)Life will be good (we hope). 3.
E N D
1. Thoughts on Shared Caches Jeff OdomUniversity of Maryland One of the common areas of performance tuning for supercomputing applications is the interaction of applications with the memory subsystems.
Understanding the cache behavior of new algorithms has become an important part of the development lifecycle.One of the common areas of performance tuning for supercomputing applications is the interaction of applications with the memory subsystems.
Understanding the cache behavior of new algorithms has become an important part of the development lifecycle.
2. 2 A Brief History of Time First there was the single CPU
Memory tuning new field
Large improvements possible
Life is good
Then came multiple CPUs
Rethink memory interactions
Life is good (again)
Now there’s multi-core on multi-CPU
Rethink memory interactions (again)
Life will be good (we hope)
3. 3 SMP vs. CMP Symmetric Multiprocessing (SMP)
Single CPU core per chip
All caches private to each CPU
Communication via main memory
Chip Multiprocessing (CMP)
Multiple CPU cores on one integrated circuit
Private L1 cache
Shared second-level and higher caches
4. 4 CMP Features Thread-level parallelism
One thread per core
Same as SMP
Shared higher-level caches
Reduced latency
Improved memory bandwidth
Non-homogeneous data decomposition
Not all cores are created equal
5. 5 CMP Challenges New optimizations
False sharing/private data copies
Delaying reads until shared
Fewer locations to cache data
More chance of data eviction in high-throughput computations
Hybrid SMP/CMP systems
Connect multiple multi-core nodes
Composite cache sharing scheme
Cray XT4
2 cores/chip
2 chips/node
6. 6 False Sharing Occurs when two CPUs access different data structures on the same cache line
7. 7 False Sharing (SMP)
8. 8 False Sharing (SMP)
9. 9 False Sharing (SMP)
10. 10 False Sharing (SMP)
11. 11 False Sharing (SMP)
12. 12 False Sharing (SMP)
13. 13 False Sharing (SMP)
14. 14 False Sharing (SMP)
15. 15 False Sharing (CMP)
16. 16 False Sharing (CMP)
17. 17 False Sharing (CMP)
18. 18 False Sharing (CMP)
19. 19 False Sharing (CMP)
20. 20 False Sharing (CMP)
21. 21 False Sharing (CMP)
22. 22 False Sharing (CMP)
23. 23 False Sharing (SMP vs. CMP) With private L2 (SMP), modification of co-resident data structures results in trips to main memory
In CMP, false sharing impact is limited by the shared L2
Latency from L1 to L2 much less than L2 to main memory
24. 24 Maintaining Private Copies Two threads modifying the same cache line will want to move data to their L1
Simultaneous reading/modification causes thrashing between L1’s and L2
Keeping a copy of data in separate cache line keeps data local to the processor
Updates to shared data occur less often
25. 25 Delaying Reads Until Shared Often the results from one thread are pipelined to another
Typical signal-based sharing:
Thread 1 accesses data, is pulled into L1T1
T1 modifies data
T1 signals T2 that data is ready
T2 requests data, forcing eviction from L1T1 into L2Shared
Data is now shared
L1 line not filled in, wasting space
26. 26 Delaying Reads Until Shared Optimized sharing:
T1 pulls data into L1T1 as before
T1 modifies data
T1 waits until it has other data to fill the line with, then uses that to push data into L2Shared
T1 signals T2 that data is ready
T1 and T2 now share data in L2Shared
Eviction is side-effect of loading line
27. 27 Hybrid Models Most CMP systems will have SMP as well
Large core density not feasible
Want to balance processing with cache sizes
Different access patterns
Co-resident cores act different than cores of different nodes
Results may differ depending on which processor pairs you get
28. 28 Experimental Framework Simics simulator
Full system simulation
Hot-swappable components
Configurable memory system
Reconfigurable cache hierarchy
Roll-your-own coherency protocol
Simulated environment
SunFire 6800, Solaris 10
Single CPU board, 4 UltraSPARC IIi
Uniform main memory access
Similar to actual hardware on hand
29. 29 Experimental Workload NAS Parallel Benchmarks
Well known, standard applications
Various data access patterns (conjugate gradient, multi-grid, etc.)
OpenMP-optimized
Already converted from original serial versions
MPI-based versions also available
Small (W) workloads
Simulation framework slows down execution
Will examine larger (A-C) versions to verify tool correctness
30. 30 Workload Results Some show marked improvement (CG)…
…others show marginal improvement (FT)…
…still others show asymmetrical loads (BT)…
…and asymmetrical improvement (EP)
31. 31 The Next Step How to get data and tools for programmers to deal with this?
Hardware
Languages
Analysis tools
Specialized hardware counters
Which CPU forced eviction
Are cores or nodes contending for data
Coherency protocol diagnostics
32. 32 The Next Step CMP-aware parallel languages
Language-based framework easier to perform automatic optimizations
OpenMP, UPC likely candidates
Specialized partitioning may be needed to leverage shared caches
Implicit data partitioning
Current languages distribute data uniformly
May require extensions (hints) in the form of language directives
33. 33 The Next Step Post-execution analysis tools
Identify memory hotspots
Provide hints on restructuring
Blocking
Execution interleaving
Convert SMP-optimized code for use in CMP
Dynamic instrumentation opportunities
34. 34 Questions?