Performance Optimizations for NUMA-Multicore Systems

Performance Optimizationsfor NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland

About me • ETH Zurich: research assistant • Research: performance optimizations • Assistant: lectures • TUCN • Student • Communications Center: network engineer • Department of Computer Science: assistant

Computing Unlimited need for performance

Performance optimizations • One goal: make programs run fast • Idea: pick good algorithm • Reduce number of operations executed • Example: sorting

Sorting Execution time [T] Number of operations

Sorting Execution time [T] bubble sort Number of operations

Sorting Execution time [T] bubble sort Number of operations quicksort

Sorting Execution time [T] bubble sort Number of operations 11X faster quicksort

Sorting • We picked good algorithm, work done • Are we really done? • Make sure our algorithm runs fast • Operations take time • We assumed 1 operation = 1 time unit T

Quicksort performance Execution time [T] 1 op = 1 T

Quicksort performance Execution time [T] 1 op = 2 T 1 op = 1 T

Quicksort performance Execution time [T] 1 op = 4 T 1 op = 2 T 1 op = 1 T

Quicksort performance Execution time [T] 1 op = 8 T 1 op = 4 T 1 op = 2 T 1 op = 1 T

Quicksort performance Execution time [T] 32% faster 1 op = 8 T bubble sort (1 op = 1 T) 1 op = 4 T 1 op = 2 T 1 op = 1 T

Latency of operations • Best algorithm not enough • Operations are executed on hardware Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU

Latency of operations • Best algorithm not enough • Operations are executed on hardware • Hardware must be used efficiently Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU

Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusion • ETH scholarship

Memory accesses CPU 230 cycles access latency RAM

Memory accesses CPU Total access latency = 16 x 230 cycles = 3680 cycles Total access latency = ? 230 cycles access latency RAM

Caching CPU 230 cycles access latency RAM

Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM

Hits and misses CPU Cache miss: data not in cache = 230 cycles Cache hit: data in cache = 30 cycles 30 cycles access latency Cache 200 cycles access latency RAM

Total access latency CPU Total access latency = ? Total access latency = 4 misses + 12 hits = 4 x 230 cycles + 12 * 30 cycles = 1280 cycles 30 cycles access latency Cache 200 cycles access latency RAM

Benefits of caching • Comparison • Architecture w/o cache: T = 230 cycles • Architecture w/ cache: Tavg = 80 cycles → 2.7X improvement • Do caches always help? • Can you think of access pattern with bad cache usage?

Caching CPU Block size: 35 cycles access latency Cache 200 cycles access latency RAM

Cache-aware programming • Today’s example: matrix-matrix multiplication (MMM) • Number of operations: n3 • Compare naïve and optimized implementation • Same number of operations

MMM: naïve implementation j j = C B A X i i for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] = sum; }

MMM CPU Cache hits Total accesses A[][] B[][] ? 4 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A

MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 4 0 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A

MMM: Cache performance • Hit rate • Accesses to A[][]: 3/4 = 75% • Accesses to B[][]:0/4 = 0% • All accesses: 38% • Can we do better?

Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } for(i=0; i<N;i++) for(k=0; k<N; k++) { r = A[i][k]; for (j=0; j < N; j++) C[i][j] += r*B[k][j]; } k = C B A X k i i

MMM CPU Cache hits Total accesses C[][] B[][] 3 ? 4 3 4 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A

Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) C[][]: 3/4 = 75% hit rate B[][]:3/4 = 75% hit rate All accesses: 75% hit rate A[][]: 3/4 = 75% hit rate B[][]:0/4 = 0% hit rate All accesses: 38% hit rate Better performance due to cache-friendliness?

Performance of MMM Execution time [s]

Performance of MMM Execution time [s] 20X

Cache-aware programming • Two versions of MMM: ijk and ikj • Same number of operations (~n3) • ikj20X better than ijk • Good performance depends on two aspects • Good algorithm • Implementation that takes hardware into account • Hardware • Many possibilities for inefficiencies • We consider only the memory system in this lecture

Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusions • ETH scholarship

Cache-based architecture CPU 10 cycles access latency L1-C 20 cycles access latency Cache L2 Cache Bus Controller 200 cycles access latency Memory Controller RAM

Multi-core multiprocessor Processor package Processor package Core Core Core Core Core Core Core CPU Core CPU L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C Cache L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

Experiment • Performance of a well-optimized program • soplex from SPECCPU 2006 • Multicore-multiprocessor systems are parallel • Multiple programs run on the system simultaneously • Contender program: milc from SPECCPU 2006 • Examine 4 execution scenarios soplex milc

Execution scenarios Processor 0 Processor 1 soplex Core Core milc Core Core CPU Core Core Core Core L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM

Performance with sharing: soplex Execution time relative to solo execution

Performance Optimizations for NUMA-Multicore Systems

Performance Optimizations for NUMA-Multicore Systems

Presentation Transcript

Performance of Windows Multicore Systems on Threading and MPI

Memory System Performance in a NUMA Multicore Multiprocessor

Memory System Performance in a NUMA Multicore Multiprocessor

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead

Performance Measurements of CCR and MPI on Multicore Systems

CCR Multicore Performance

Performance Tuning on Multicore Systems for Feature Matching within Image Collections

Parallel Multidimensional Scaling Performance on Multicore Systems

Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst

Scalable Performance Optimizations for Dynamic Applications

Multicore Systems

Active Library Infrastructure for Multicore Performance

Performance Optimizations for running NIM on GPUs

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS

Performance Analysis and Compiler Optimizations

USING OS OBSERVATIONS TO IMPROVE PERFORMANCE IN MULTICORE SYSTEMS

Performance of a Multi-Paradigm Messaging Runtime on Multicore Systems

Lecture 6: Multicore Systems

Modelling Performance Optimizations for Content-based Publish/Subscribe

Performance Model for Future Multicore Process Designs

USING OS OBSERVATIONS TO IMPROVE PERFORMANCE IN MULTICORE SYSTEMS

Performance Optimizations for running NIM on GPUs