400 likes | 479 Views
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead. Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich. NUMA multicores. Processor 0. Processor 1. 0. 1. 2. 3. Cache. Cache. MC. MC. MC. IC. IC. IC. IC.
E N D
Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead ZoltanMajo and Thomas R. Gross Department of Computer Science ETH Zurich
NUMA multicores Processor 0 Processor 1 0 1 2 3 Cache Cache MC MC MC IC IC IC IC MC DRAM memory DRAM memory DRAM memory DRAM memory
NUMA multicores Two problems: • NUMA: interconnect overhead Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC IC IC DRAM memory DRAM memory MA MB
NUMA multicores Two problems: • NUMA: interconnect overhead • multicore: cache contention Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC Cache IC IC DRAM memory DRAM memory MA MB
Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation
Multi-clone experiments • Intel Xeon E5520 • 4 clones of soplex (SPEC CPU2006) • local clone • remote clone • Memory behavior of unrelated programs Processor 0 Processor 1 0 1 2 3 4 5 6 7 Cache Cache C C C C MC IC IC MC DRAM memory DRAM memory C C C C C M M M M C
Local bandwidth: 32% Local bandwidth: 80% Local bandwidth: 0% Local bandwidth: 57% Local bandwidth: 100% 2 1 C C C C C C C C C C C C C C C M M M M M M M M M M M M M M M M M M M M C C C C C Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache 3 DRAM DRAM DRAM DRAM DRAM 4 5
Performance of schedules • Which is the best schedule? • Baseline: single-program execution mode C Cache Cache M
Execution time Slowdown relative to baseline local clones remote clones average C C C
Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation
N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement
Step 1: Maximum-local mapping Processor 0 Processor 1 A MA 0 1 2 3 4 5 6 7 B MB Cache Cache C MC D MD DRAM DRAM
Default OS scheduling Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM
N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement
Step 2: Cache-aware refinement In an SMP: Processor 0 Processor 1 D C A B 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM
Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MA MB MC MD DRAM DRAM In an SMP:
Step 2: Cache-aware refinement Processor 0 Processor 1 A B D C 0 1 2 3 4 5 6 7 Cache Cache NUMA penalty MA MB MC MD DRAM DRAM Performance degradation A B C A B D C D In an SMP:
Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM In a NUMA:
Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM In a NUMA:
Step 2: Cache-aware refinement Processor 0 Processor 1 1 2 3 4 5 6 7 A C D B 0 Cache Cache NUMA penalty NUMA allowance MA MB MC MD DRAM DRAM Performance degradation A B C A C D D B In a NUMA:
Performance factors Two factors cause performance degradation: • NUMA penalty slowdown due toremote memory access • cache pressure local processes:misses / KINST (MPKI) remote processes:MPKI x NUMA penalty NUMA penalty
Implementation • User-mode extension to the Linux scheduler • Performance metrics • hardware performance counter feedback • NUMA penalty • perfect information from program traces • estimate based on MPKI • All memory for a process allocated on one processor
Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation
Workloads NUMA penalty • SPEC CPU2006 subset • 11 multi-program workloads (WL1 WL11) 4-program workloads(WL1 WL9) 8-program workloads(WL10, WL11) CPU-bound Memory-bound
Memory allocation setup • Where the memory of each process is allocated influences performance • Controlled setup: memory allocation maps
Memory allocation maps Processor 0 A M A Processor 1 B MB C MC D MD Cache Cache Allocation map: 0000 DRAM DRAM M A MB MC MD
Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD
Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor 1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD Unbalanced Balanced
Evaluation • Baseline: Linux average • Linux scheduler non-deterministic • average performance degradation in all possible cases • N-MASS with perfect NUMA penalty information
WL9: Linux average Average slowdown relative to single-program mode
WL9: N-MASS Average slowdown relative to single-program mode
WL1: Linux average and N-MASS Average slowdown relative to single-program mode
N-MASS performance • N-MASS reduces performance degradation by up to 22% • Which factor more important: interconnect overhead or cache contention? • Compare: - maximum-local - N-MASS (maximum-local + cache refinement step)
Data-locality vs. cache balancing (WL9) Performance improvement relative to Linux average
Data-locality vs. cache balancing (WL1) Performance improvement relative to Linux average
Data locality vs. cache balancing • Data-locality more important than cache balancing • Cache-balancing gives performance benefits mostly with unbalanced allocation maps • What if information about NUMA penalty not available?
Estimating NUMA penalty NUMA penalty • NUMA penalty is not directly measurable • Estimate: fit linear regression onto MPKI data
Estimate-based N-MASS: performance Performance improvement relative to Linux average
Conclusions • N-MASS: NUMAmulticore-aware scheduler • Data locality optimizations more beneficial than cache contention avoidance • Better performance metrics needed for scheduling