160 likes | 170 Views
Explore the features, cache misses, and performance expectations of Cache-Coherent Non-Uniform-Memory-Access (CC-NUMA) and Cache-Only Memory Architectures (COMA). Simulation results show how these architectures perform in various applications. Discover how cache size, network latency, and data directory structure variations impact their efficiency.
E N D
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis
Overview • Common Features • CC-NUMA • COMA • Cache Misses • Performance Expectations • Simulation & Results • COMA-F
Common Features • Large-scale multiprocessors • Single address space • Distributed main memory • Directory-based cache coherence • Scalable interconnection network • Examples:
CC-NUMA Cache-Coherent Non-Uniform-Memory-Access Machines • Network independent • Write-invalidate cache coherence protocol • 2 hop miss • 3 hop miss
COMA Cache-Only Memory Architectures COMA • Attraction memory – per-node memory acts as secondary/tertiary cache • Data is distributed and mobile • Directory is dynamically distributed in a hierarchy • Combining – can optimize multiple reads • LU - 47%, Barnes Hut - 6%, remaining < 1% • Reduces the average cache latency • Increased overhead for directory structure
CC-NUMA COMA Cache Misses Which architecture has lower latency?
CC-NUMA COMA Performance Expectations
Simulation • 16 processors • Cache lines = 16 bytes • Cache size of 4 Kbytes • (Small – to force capacity misses)
CC-NUMA COMA Results • MP3D – Particle-based wind tunnel simulation • PTHOR – Distributed-time logic simulation • LocusRoute – VLSI standard cell router • Water – Molecular dynamics code: Water • Cholesky – Cholesky factorization of sparse matrix • LU – LU decomposition of dense matrix • Barnes-Hut – N-body problem solver O(NlogN) • Ocean – Ocean basin simulation
Page Migration – Page Size • Introduces additional overhead • Node hit rate increases as page size decreases • Reduces false sharing • Fewer pages accessed by multiple processors • Likely won’t work if data chunks are much smaller than pages (example - LU) • NUMA-M performs better for Cholesky
Initial Placement • Implemented as page migration with a max of 1 time that a page can be migrated • LU does significantly better • Ocean does the same for single vs. multiple migrations • Requires increased work for compiler and programmer
Cache Size/Network Variations • Cache Size Variations • Increasing the cache size causes coherence misses to dominate • With 64KB cache, CC-NUMA (without migration) is better for everything except Ocean. • Network Latency Variations • Even with aggressive implementations of directory structure, COMA can’t compensate in applications with significant coherence miss rate
COMA-F • Data directory information has a home node (CC-NUMA) • Supports replication and migration of data blocks (COMA-H) • Attempts to reduce the coherence miss penalty
CC-NUMA COMA Conclusion • CC-NUMA and COMA perform well for different application characteristics