1 / 16

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures

Explore the features, cache misses, and performance expectations of Cache-Coherent Non-Uniform-Memory-Access (CC-NUMA) and Cache-Only Memory Architectures (COMA). Simulation results show how these architectures perform in various applications. Discover how cache size, network latency, and data directory structure variations impact their efficiency.

dority
Download Presentation

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis

  2. Overview • Common Features • CC-NUMA • COMA • Cache Misses • Performance Expectations • Simulation & Results • COMA-F

  3. Common Features • Large-scale multiprocessors • Single address space • Distributed main memory • Directory-based cache coherence • Scalable interconnection network • Examples:

  4. CC-NUMA Cache-Coherent Non-Uniform-Memory-Access Machines • Network independent • Write-invalidate cache coherence protocol • 2 hop miss • 3 hop miss

  5. COMA Cache-Only Memory Architectures COMA • Attraction memory – per-node memory acts as secondary/tertiary cache • Data is distributed and mobile • Directory is dynamically distributed in a hierarchy • Combining – can optimize multiple reads • LU - 47%, Barnes Hut - 6%, remaining < 1% • Reduces the average cache latency • Increased overhead for directory structure

  6. CC-NUMA COMA Cache Misses Which architecture has lower latency?

  7. Figure 1

  8. CC-NUMA COMA Performance Expectations

  9. Simulation • 16 processors • Cache lines = 16 bytes • Cache size of 4 Kbytes • (Small – to force capacity misses)

  10. Results

  11. CC-NUMA COMA Results • MP3D – Particle-based wind tunnel simulation • PTHOR – Distributed-time logic simulation • LocusRoute – VLSI standard cell router • Water – Molecular dynamics code: Water • Cholesky – Cholesky factorization of sparse matrix • LU – LU decomposition of dense matrix • Barnes-Hut – N-body problem solver O(NlogN) • Ocean – Ocean basin simulation

  12. Page Migration – Page Size • Introduces additional overhead • Node hit rate increases as page size decreases • Reduces false sharing • Fewer pages accessed by multiple processors • Likely won’t work if data chunks are much smaller than pages (example - LU) • NUMA-M performs better for Cholesky

  13. Initial Placement • Implemented as page migration with a max of 1 time that a page can be migrated • LU does significantly better • Ocean does the same for single vs. multiple migrations • Requires increased work for compiler and programmer

  14. Cache Size/Network Variations • Cache Size Variations • Increasing the cache size causes coherence misses to dominate • With 64KB cache, CC-NUMA (without migration) is better for everything except Ocean. • Network Latency Variations • Even with aggressive implementations of directory structure, COMA can’t compensate in applications with significant coherence miss rate

  15. COMA-F • Data directory information has a home node (CC-NUMA) • Supports replication and migration of data blocks (COMA-H) • Attempts to reduce the coherence miss penalty

  16. CC-NUMA COMA Conclusion • CC-NUMA and COMA perform well for different application characteristics

More Related