140 likes | 329 Views
Reactive NUMA. A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood University of Wisconsin. Some Terminology. NUMA Non Uniform Memory Access CC-NUMA Cache Coherent NUMA COMA Cache Only Memory Architecture S-COMA Simple COMA. SMP Clusters.
E N D
Reactive NUMA A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood University of Wisconsin
Some Terminology • NUMA • Non Uniform Memory Access • CC-NUMA • Cache Coherent NUMA • COMA • Cache Only Memory Architecture • S-COMA • Simple COMA
SMP Clusters • Approach for large-scale shared memory parallel machines • Directory based cache coherence • RAD responsible for remote memory access
CC-NUMA • First processor causes page fault • OS Maps Virtual Address to Global Physical address • RAD snoops memory bus • Block Cache • Remote request
CC-NUMA • References global addresses directly • Remote cluster cache • Only holds remote data • Another level in cache hierarchy • Block cache is small • Sensitive to data allocation and placement • Good for scientific workloads
S-COMA • First access causes page fault • OS initializes page table, RAD translation table and access control tags • Hits serviced by local memory • Misses detected by RAD • Inhibit memory • Request data
S-COMA • Remote data in memory or cache • Allocated/Mapped at page granularity • S-COMA • OS handles allocation and migration • Large memory and cache • Fully associative • Large page size • Requires large granularity spatial locality • Possible Thrashing
R-NUMA • Combine S-COMA and CC-NUMA • Map CC-NUMA pages to Global PA • Map S-COMA pages to Local PA • Often requires no additional hardware • Distinguish 2 types of pages • Reuse pages Data used frequently on the same node • Communication pages Data exchange between nodes
Switching Mechanism • Reuse pages • Capacity and Conflict Misses • S-COMA • Communication pages • Coherence Misses • CC-NUMA • Detect refetches of evicted blocks • Trivial for read-only blocks in non-notifying protocol (still shared) • Additional hardware required for read-write-blocks • Count refetches on per-node, per-page basis
Qualitative Performance • Analysis of worst case behavior • Performance depends on S-COMA resp. CC-NUMA overhead • Realistically R-NUMA no more than 3 times worse than vanilla CC-NUMA or S-COME • In practice “bound” is much smaller
Conclusions • Dynamically react to program behavior • Exploit best caching strategy • Per Page basis • Worst case performance is bound • Quantitative Results indicate • R-NUMA usually no worse than best of CC-NUMA and S-COMA • If worse, still way better than worst case • Never worse than both • Less sensitive to relocation threshold or overhead than S-COMA • Less sensitive to cache size than CC-NUMA
Questions • Sounds like a free lunch • Does R-NUMA really require no additional hardware? • Dynamically switching always good in research papers • What about the practice?