Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis

Overview • Common Features • CC-NUMA • COMA • Cache Misses • Performance Expectations • Simulation & Results • COMA-F

Common Features • Large-scale multiprocessors • Single address space • Distributed main memory • Directory-based cache coherence • Scalable interconnection network • Examples:

CC-NUMA Cache-Coherent Non-Uniform-Memory-Access Machines • Network independent • Write-invalidate cache coherence protocol • 2 hop miss • 3 hop miss

COMA Cache-Only Memory Architectures COMA • Attraction memory – per-node memory acts as secondary/tertiary cache • Data is distributed and mobile • Directory is dynamically distributed in a hierarchy • Combining – can optimize multiple reads • LU - 47%, Barnes Hut - 6%, remaining < 1% • Reduces the average cache latency • Increased overhead for directory structure

CC-NUMA COMA Cache Misses Which architecture has lower latency?

Figure 1

CC-NUMA COMA Performance Expectations

Simulation • 16 processors • Cache lines = 16 bytes • Cache size of 4 Kbytes • (Small – to force capacity misses)

Results

CC-NUMA COMA Results • MP3D – Particle-based wind tunnel simulation • PTHOR – Distributed-time logic simulation • LocusRoute – VLSI standard cell router • Water – Molecular dynamics code: Water • Cholesky – Cholesky factorization of sparse matrix • LU – LU decomposition of dense matrix • Barnes-Hut – N-body problem solver O(NlogN) • Ocean – Ocean basin simulation

Page Migration – Page Size • Introduces additional overhead • Node hit rate increases as page size decreases • Reduces false sharing • Fewer pages accessed by multiple processors • Likely won’t work if data chunks are much smaller than pages (example - LU) • NUMA-M performs better for Cholesky

Initial Placement • Implemented as page migration with a max of 1 time that a page can be migrated • LU does significantly better • Ocean does the same for single vs. multiple migrations • Requires increased work for compiler and programmer

Cache Size/Network Variations • Cache Size Variations • Increasing the cache size causes coherence misses to dominate • With 64KB cache, CC-NUMA (without migration) is better for everything except Ocean. • Network Latency Variations • Even with aggressive implementations of directory structure, COMA can’t compensate in applications with significant coherence miss rate

COMA-F • Data directory information has a home node (CC-NUMA) • Supports replication and migration of data blocks (COMA-H) • Attempts to reduce the coherence miss penalty

CC-NUMA COMA Conclusion • CC-NUMA and COMA perform well for different application characteristics

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures

Presentation Transcript

Cache performance

NUMA machines and directory cache mechanisms

Evaluation of Patients in Coma

Cache Coherence for GPU Architectures

Cache Performance

Cache Coherence in NUMA Machines

Comparative Evaluation of TEL and Daffodil

Performance and Productivity of Emerging Architectures

Cache Memory and Performance

Cache performance

Performance of Cache Memory

Scalable Cache Coherent Systems

Cache Organization and Performance Evaluation

Cache Performance

Performance Evaluation of Web Proxy Cache Replacement Policies

Performance Evaluation of Architectures

Design and Performance Evaluation of Networked Storage Architectures

Comparative Evaluation

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems