Memory System Performance in a NUMA Multicore Multiprocessor

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich

Summary • NUMA multicore systems are unfair to local memory accesses • Local execution sometimes suboptimal

Outline • NUMA multicores: how it happened • Experimental evaluation: Intel Nehalem • Bandwidth sharing model • The next generation: Intel Westmere

NUMA multicores: how it happened First generation: SMP 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 BusC BusC BusC BusC BusC BusC BusC BusC Northbridge MC MC DRAM memory

NUMA multicores: how it happened Next generation: NUMA 0 1 2 3 4 5 6 7 BusC BusC BusC BusC IC IC Northbridge MC MC MC DRAM memory DRAM memory

NUMA multicores: how it happened Next generation: NUMA 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 MC MC IC IC DRAM memory DRAM memory

Bandwidth sharing • Frequent scenario:bandwidth sharedbetween cores • Sharing model for the Intel Nehalem 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 MC IC IC MC DRAM memory DRAM memory

Evaluation system Intel Nehalem E5520 2 x 4 cores 8 MB level 3 cache 12 GB DDR3 RAM 5.86 GT/s QPI Processor 0 Processor 1 0 1 2 3 4 5 6 7 Level 3 cache Level 3 cache Global Queue Global Queue Global Queue Global Queue MC QPI QPI MC QPI QPI DRAM memory DRAM memory

Bandwidth sharing: local accesses Processor 0 Processor 1 0 1 2 3 4 5 6 7 0 3 Level 3 cache Level 3 cache Global Queue Global Queue Global Queue MC QPI QPI MC DRAM memory DRAM memory DRAM memory

Bandwidth sharing: remote accesses Processor 0 Processor 1 0 1 2 3 4 5 6 7 0 4 5 3 Level 3 cache Level 3 cache Global Queue Global Queue Global Queue MC QPI QPI MC DRAM memory DRAM memory DRAM memory

Bandwidth sharing: combined accesses Processor 0 Processor 1 0 1 2 3 4 5 6 7 0 4 5 3 Level 3 cache Level 3 cache Global Queue Global Queue Global Queue Global Queue MC QPI QPI MC DRAM memory DRAM memory DRAM memory

Global Queue • Mechanism to arbitrate between different types of memory accesses • We look at fairness of the Global Queue: • local memory accesses • remote memory accesses • combined memory accesses

Benchmark program • STREAM triad for (i=0; i<SIZE; i++) { a[i]=b[i]+SCALAR*c[i]; } • Multiple co-executing triad clones

Multi-clone experiments • All memory allocated on Processor 0 • Local clones: Remote clones: • Example benchmark configurations: C C (0L, 3R) (2L, 3R) (2L, 0R) C C C C C C C C C C Processor 0 Processor 1 Processor 0 Processor 1

GQ fairness: local accesses Processor 0 Processor 1 Total bandwidth [GB/s] 0 1 2 3 4 5 6 7 C C C C Cache Cache GQ GQ IMC QPI QPI IMC DRAM memory DRAM DRAM

GQ fairness: remote accesses Processor 0 Processor 1 Total bandwidth [GB/s] 0 1 2 3 4 5 6 7 C C C C Cache Cache GQ GQ IMC QPI QPI IMC DRAM memory DRAM DRAM

Global Queue fairness • Global Queue fair when there areonly local/remote accesses in the system • What about combined accesses?

GQ fairness: combined accesses Execute clones in all possible configurations (2L, 3R)

GQ fairness: combined accesses Execute clones in all possible configurations

GQ fairness: combined accesses Total bandwidth [GB/s]

GQ fairness: combined accesses Execute clones in all possible configurations

Combined accesses Total bandwidth [GB/s]

Combined accesses • In configuration (4L, 1R)remote clone gets30% more bandwidth than a local clone • Remote execution can be better than local

Bandwidth sharing model 0 1 2 3 4 5 6 7 C C Level 3 cache Level 3 cache Global Queue Global Queue IMC QPI QPI IMC DRAM memory DRAM memory DRAM memory

Sharing factor () • Characterizes the fairness of the Global Queue • Dependence of sharing factor on contention?

Contention affects sharing factor Processor 0 Processor 0 C DRAM QPI C contenders C C C

Contention affects sharing factor Sharing factor ()

Combined accesses Total bandwidth [GB/s]

Contention affects sharing factor • Sharing factor decreases with contention • With local contention remote execution becomes more favorable

The next generation Intel Westmere X5680 2 x 6 cores 12 MB level 3 cache 144 GB DDR3 RAM 6.4 GT/s QPI 6 7 8 9 A B 0 1 2 3 4 5 Level 3 cache Level 3 cache Global Queue Global Queue IMC QPI QPI IMC DRAM memory DRAM memory

The next generation Total bandwidth [GB/s]

Conclusions • Optimizing for data locality can be suboptimal • Applications: • OS scheduling (see ISMM’11 paper) • data placement and computation scheduling

Thank you! Questions?

Memory System Performance in a NUMA Multicore Multiprocessor

Memory System Performance in a NUMA Multicore Multiprocessor

Presentation Transcript

Shared Memory: UMA and NUMA

Memory System Performance in a NUMA Multicore Multiprocessor

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead

Non-Uniform Memory Access Computers (NUMA)

CCR Multicore Performance

Silo : Speedy Transactions in Multicore In-Memory Databases

NUMA aware heap memory manager

Tornado: Maximizing Locality And Concurrency In A Shared Memory Multiprocessor Operating System

Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System

Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads

Memory model constraints limit multiprocessor performance.

Multiprocessor Kernel Performance Profiling

Memory System Performance

12.4 Memory Organization in Multiprocessor Systems

MULTIPROCESSOR OPERATING SYSTEM

Real-Time Performance and Middleware for Multiprocessor and Multicore Linux Platforms*

Performance Optimizations for NUMA-Multicore Systems

Benchmarking SMP Memory System Performance

Multiprocessor System Interconnects