90 likes | 98 Views
Explore the integration of Cache Coherent NUMA (CC-NUMA) and Simple Cache Only Memory Architectures (S-COMA) into a hybrid system, maximizing performance. Study the advantages, disadvantages, and qualitative performance of the novel R-NUMA approach.
E N D
Reactive NUMA:A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood Computer Science Department University of Wisconsin, Madison Presented by Anita Lungu February 17, 2006
Context and Motivation • Large-Scale Distributed Shared Memory parallel machines • Directory coherence between SMPs • Local access fast / Remote access slow • Problem: • Hide remote memory access latency • Solutions: • Cache Coherent NUMA (CC-NUMA) • Best when: coherency misses dominate • Simple Cache Only Memory Architectures (S-COMA) • Best when: capacity misses dominate • Opportunity: • Hybrid: R-NUMA= CC-NUMA + S-COMA • Support both: dynamically select protocol for each page • Better performance than each separately =>Best of both worlds
CC-NUMA • Remote cluster cache • Keeps only remote data • Block level granularity • Small & fast (SRAM) • Can be larger & slower (DRAM) • Data elements: home node allocated • Advantage when: • Remote working set fits in small block cache • Mostly coherence misses • Disadvantage when: • Many data accesses are remote
S-COMA • Distributed main memory = 2nd level cache for remote data • Data elements: NO home node • Allocation and mapping • Page granularity (Software) • Standard Virtual Address Translation hardware • Coherence • Block granularity (Hardware) • Extra hardware: • Access control tags • 2 bits/block, trigger to inhibit memory • Auxiliary SRAM translation table • Convert Local Physical Pages<->Global Physical Pages (home) • Advantage when: • Mostly capacity/cold misses • Remote data is reused often
R-NUMA • Classify remote pages: • Reuse: • accessed many times by a node • Communication: • Used to communicate data between nodes • Default all pages to CC-NUMA • Dynamically change page to S-COMA • Threshold: #remote capacity/conflict misses per page (in block cache) • Per node decision
Qualitative Performance • Worst case scenario • Page relocated from block cache (CC-NUMA) to memory (S-COMA) and not referenced again • Worst case performance • Depends on cost of relocation (change page from CC-NUMA to S-COMA) relative to cost of page allocation • R-NUMA can be 3x worse than either CC-NUMA or S-COMA • But… • Threshold for optimal worst case performance <> threshold for optimal average performance
Base System Results • Best case: • R-NUMA reduces execution time by 37% • Worst case: • R-NUMA increases execution time by 57% • CC-NUMA can be 179% worse than S-COMA • S-COMA can be 315% worse that CC-NUMA
Sensitivity Results 2 1 1. S-COMA and R-NUMA sensitivity to page-fault and TLB invalidation overhead 2. R-NUMA sensitivity to relocation threshold value 3. CC-NUMA and R-NUMA sensitivity to cache size 3