Caching in multiprocessor systems

Caching in multiprocessor systems Tiina Niklander In AMICT 2009, Petrozavodsk 19.5.2009

Background • More transistors on one chip • Multiple cores • Larger cache • Multiple on chip caches • More functionality (more functional units, dedicated multimedia / deciphering cell, integrated GPU) • Multiple cores introduce • Cache organization • Private vs shared caches • Cache coherence

Cache organization • Common organization: • L1 is private • Last-level cache is shared • With three levels: • L1 private • L2 ? Private or shared • L3 Shared

Private vs Shared cache • Fully private, fully shared, partially shared Private L2 (pair of processors share) Shared L2 (all can access all L2) F. Sibai: On the performance benefits of sharing and privatizing second and third-level cache memories in homogeneous multi-core architectures. Microprocessors and Microsystems 32 ( 2008), pp. 405-412

Shared cache • Simple coherence issue (just one copy) • Different latencies (CPU - cache location) • Cache access competition (wait for other core) M. Kandemir, F. Li, M.J. Irwin, S.W. Son: A Novel Migration-Based NUCA Design for Chip Multiprocessors. In SC2008. IEEE, 2008, pp.

Private cache • No access competition, smaller latencies, • But coherence becomes an issue! • Same date in multiple caches -> invalidate on write • Cache partitioning • Design time: Fixed partitioning • Run time: • Fixed partitioning (configuration issue) • Dynamic (based on current need)

Cache coherence • Protocols: MESI, MSI, MOSI, MOESI • Invalidation message: RFO (Read for ownership) • Each cache snoops the bus to monitor memory ops M – modified (O- Owned) E – Exlusive S – Shared I – Invalid N – not allowed state Y – allowed state wikipedia

(Distributed) cooperative caches • Add a directory structure • Knows the data locations in local caches • Cache-to-cache copying • When in another cache (directory locates) • On eviction (store temporarily on another cache) E, Herrero, J. Conzález, R. Canal: Distributed Cooperative Caching. In PACT’08. ACM 2008, pp. 134-142

New improvement ideas for cache performance 1/2 • Split the cache for different tasks • Dynamically allocate cache areas • Software controlled eviction • GOAL: thread moves unneeded, but strongly-shared data to shared cache to improve performance of other threads • New instruction evict tells the processor to move some data from private L1 or L2 to shared L3

New improvement ideas for cache performance 2/2 • Helper threads • GOAL: additional thread executes parts of the code ahead of the actual thread to ‘prefetch’ data to cache • Generate memory traces for the programmer • Tuning the software performance

Conclusion • Focus on fine-tuning the cache performance • Cache coherence itself is solved earlier • Not always used (if allowed non-coherent usage) • L2 and L3 caches • Shared or private • Cache partitioning • Support for software-based improvements • Eviction hints • Traces • Prefetching (like helper thread)

References • S. Fide, S. Jenks: Proactive use of shared L3 caches to enhance cache communic-ations in multi-core processors. IEEE Comp. Arch. L. vol 7 (2008), pp 57-60 • E. Herrero, J. Conzález, R. Canal: Distributed Cooperative Caching. In Conf. on Parallel architectures and compilation techniques, PACT’08. ACM 2008, pp. 134-142 • M. Kandemir, F. Li, M.J. Irwin, S.W. Son: A Novel Migration-Based NUCA Design for Chip. Multiprocessors. In Proc. of the 2008 ACM/IEEE Conf. on Supercomputing. IEEE, 2008, pp. 1-12 • L. Peng, et.al.: Memory hierarchy performance measurement of commercial dual-core desktop processors. Journal of Systems Architecture 54(2008), pp. 816-828. • F. Sibai: On the performance benefits of sharing and privatizing second and third-level cache memories in homogeneous multi-core architectures. Microprocessors and Microsystems 32 ( 2008), pp. 405-412 • J. Zhang, X. Fan, S.H. Liu: A Pollution Alleviate L2 Cache Replacement Policy for Chip Multiprocessor Architecture. In Int. Conf. on Networking, Architecture and Storage, IEEE, 2008, pp. 310-316

Caching in multiprocessor systems