Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA:A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood Computer Science Department University of Wisconsin, Madison Presented by Anita Lungu February 17, 2006

Context and Motivation • Large-Scale Distributed Shared Memory parallel machines • Directory coherence between SMPs • Local access fast / Remote access slow • Problem: • Hide remote memory access latency • Solutions: • Cache Coherent NUMA (CC-NUMA) • Best when: coherency misses dominate • Simple Cache Only Memory Architectures (S-COMA) • Best when: capacity misses dominate • Opportunity: • Hybrid: R-NUMA= CC-NUMA + S-COMA • Support both: dynamically select protocol for each page • Better performance than each separately =>Best of both worlds

CC-NUMA • Remote cluster cache • Keeps only remote data • Block level granularity • Small & fast (SRAM) • Can be larger & slower (DRAM) • Data elements: home node allocated • Advantage when: • Remote working set fits in small block cache • Mostly coherence misses • Disadvantage when: • Many data accesses are remote

S-COMA • Distributed main memory = 2nd level cache for remote data • Data elements: NO home node • Allocation and mapping • Page granularity (Software) • Standard Virtual Address Translation hardware • Coherence • Block granularity (Hardware) • Extra hardware: • Access control tags • 2 bits/block, trigger to inhibit memory • Auxiliary SRAM translation table • Convert Local Physical Pages<->Global Physical Pages (home) • Advantage when: • Mostly capacity/cold misses • Remote data is reused often

R-NUMA • Classify remote pages: • Reuse: • accessed many times by a node • Communication: • Used to communicate data between nodes • Default all pages to CC-NUMA • Dynamically change page to S-COMA • Threshold: #remote capacity/conflict misses per page (in block cache) • Per node decision

Qualitative Performance • Worst case scenario • Page relocated from block cache (CC-NUMA) to memory (S-COMA) and not referenced again • Worst case performance • Depends on cost of relocation (change page from CC-NUMA to S-COMA) relative to cost of page allocation • R-NUMA can be 3x worse than either CC-NUMA or S-COMA • But… • Threshold for optimal worst case performance <> threshold for optimal average performance

Base System Results • Best case: • R-NUMA reduces execution time by 37% • Worst case: • R-NUMA increases execution time by 57% • CC-NUMA can be 179% worse than S-COMA • S-COMA can be 315% worse that CC-NUMA

Sensitivity Results 2 1 1. S-COMA and R-NUMA sensitivity to page-fault and TLB invalidation overhead 2. R-NUMA sensitivity to relocation threshold value 3. CC-NUMA and R-NUMA sensitivity to cache size 3

Questions?

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Presentation Transcript

Shared Memory: UMA and NUMA

NUMA machines and directory cache mechanisms

Scalable CC-NUMA Design Study - SGI Origin 2000

Composing Scalability and Node Design in CC-NUMA

Multiprocessing and NUMA

NUMA aware heap memory manager

Numa Pompilius (he died)

Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst

NUMA Tuning for Java Server Applications

Numa Pompiliuis

Operating System Support for improving data locality on CC-NUMA machines

System Architecture: Big Iron (NUMA)

Reactive NUMA

Performance Optimizations for NUMA-Multicore Systems

NUMA Parallel Machines

Scalable CC-NUMA Design Study - SGI Origin 2000

Operating System Support for improving data locality on CC-NUMA machines

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures