1.22k likes | 1.38k Views
Performance Optimizations for NUMA-Multicore Systems. Zolt án Majó Department of Computer Science ETH Z urich , Switzerland. About me. ETH Zurich: research assistant Research: performance optimizations Assistant: lectures TUCN Student Communications Center: network engineer
E N D
Performance Optimizationsfor NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland
About me • ETH Zurich: research assistant • Research: performance optimizations • Assistant: lectures • TUCN • Student • Communications Center: network engineer • Department of Computer Science: assistant
Computing Unlimited need for performance
Performance optimizations • One goal: make programs run fast • Idea: pick good algorithm • Reduce number of operations executed • Example: sorting
Sorting Execution time [T] Number of operations
Sorting Execution time [T] bubble sort Number of operations
Sorting Execution time [T] bubble sort Number of operations quicksort
Sorting Execution time [T] bubble sort Number of operations 11X faster quicksort
Sorting • We picked good algorithm, work done • Are we really done? • Make sure our algorithm runs fast • Operations take time • We assumed 1 operation = 1 time unit T
Quicksort performance Execution time [T] 1 op = 1 T
Quicksort performance Execution time [T] 1 op = 2 T 1 op = 1 T
Quicksort performance Execution time [T] 1 op = 4 T 1 op = 2 T 1 op = 1 T
Quicksort performance Execution time [T] 1 op = 8 T 1 op = 4 T 1 op = 2 T 1 op = 1 T
Quicksort performance Execution time [T] 32% faster 1 op = 8 T bubble sort (1 op = 1 T) 1 op = 4 T 1 op = 2 T 1 op = 1 T
Latency of operations • Best algorithm not enough • Operations are executed on hardware Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU
Latency of operations • Best algorithm not enough • Operations are executed on hardware • Hardware must be used efficiently Stage 3: Retire operation Stage 1: Dispatch operation Stage 2: Execute operation CPU
Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusion • ETH scholarship
Memory accesses CPU 230 cycles access latency RAM
Memory accesses CPU Total access latency = 16 x 230 cycles = 3680 cycles Total access latency = ? 230 cycles access latency RAM
Caching CPU 230 cycles access latency RAM
Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM
Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM
Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM
Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM
Caching CPU Block size: 30 cycles access latency Cache 200 cycles access latency RAM
Hits and misses CPU Cache miss: data not in cache = 230 cycles Cache hit: data in cache = 30 cycles 30 cycles access latency Cache 200 cycles access latency RAM
Total access latency CPU Total access latency = ? Total access latency = 4 misses + 12 hits = 4 x 230 cycles + 12 * 30 cycles = 1280 cycles 30 cycles access latency Cache 200 cycles access latency RAM
Benefits of caching • Comparison • Architecture w/o cache: T = 230 cycles • Architecture w/ cache: Tavg = 80 cycles → 2.7X improvement • Do caches always help? • Can you think of access pattern with bad cache usage?
Caching CPU Block size: 35 cycles access latency Cache 200 cycles access latency RAM
Cache-aware programming • Today’s example: matrix-matrix multiplication (MMM) • Number of operations: n3 • Compare naïve and optimized implementation • Same number of operations
MMM: naïve implementation j j = C B A X i i for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] = sum; }
MMM CPU Cache hits Total accesses A[][] B[][] ? 4 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A
MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A
MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A
MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 ? 4 30 cycles access latency Cache 200 cycles access latency RAM C B A
MMM CPU Cache hits Total accesses A[][] B[][] ? 4 3 4 0 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A
MMM: Cache performance • Hit rate • Accesses to A[][]: 3/4 = 75% • Accesses to B[][]:0/4 = 0% • All accesses: 38% • Can we do better?
Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) for(i=0; i<N;i++) for(j=0; j<N; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } for(i=0; i<N;i++) for(k=0; k<N; k++) { r = A[i][k]; for (j=0; j < N; j++) C[i][j] += r*B[k][j]; } k = C B A X k i i
MMM CPU Cache hits Total accesses C[][] B[][] 3 ? 4 3 4 ? 30 cycles access latency Cache 200 cycles access latency RAM C B A
Cache-friendly MMM Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj) C[][]: 3/4 = 75% hit rate B[][]:3/4 = 75% hit rate All accesses: 75% hit rate A[][]: 3/4 = 75% hit rate B[][]:0/4 = 0% hit rate All accesses: 38% hit rate Better performance due to cache-friendliness?
Performance of MMM Execution time [s]
Performance of MMM Execution time [s] 20X
Cache-aware programming • Two versions of MMM: ijk and ikj • Same number of operations (~n3) • ikj20X better than ijk • Good performance depends on two aspects • Good algorithm • Implementation that takes hardware into account • Hardware • Many possibilities for inefficiencies • We consider only the memory system in this lecture
Outline • Introduction: performance optimizations • Cache-aware programming • Scheduling on multicore processors • Using run-time feedback • Data locality optimizations on NUMA-multicores • Conclusions • ETH scholarship
Cache-based architecture CPU 10 cycles access latency L1-C 20 cycles access latency Cache L2 Cache Bus Controller 200 cycles access latency Memory Controller RAM
Multi-core multiprocessor Processor package Processor package Core Core Core Core Core Core Core CPU Core CPU L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C Cache L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM
Experiment • Performance of a well-optimized program • soplex from SPECCPU 2006 • Multicore-multiprocessor systems are parallel • Multiple programs run on the system simultaneously • Contender program: milc from SPECCPU 2006 • Examine 4 execution scenarios soplex milc
Execution scenarios Processor 0 Processor 1 soplex Core Core milc Core Core CPU Core Core Core Core L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM
Execution scenarios Processor 0 Processor 1 soplex Core Core milc Core Core CPU Core Core Core Core L1-C L1-C L1-C L1-C L1-C L1-C L1-C L1-C L2 Cache L2 Cache L2 Cache Cache L2 Cache Bus Controller Bus Controller Memory Controller Memory Controller RAM
Performance with sharing: soplex Execution time relative to solo execution