450 likes | 562 Views
A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture. Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto Lab 6-Nov-14. Agenda. Background Motivation Previous Work Adaptive Schedulers IBM Power 5 Architecture
E N D
A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto Lab 6-Nov-14
Agenda • Background • Motivation • Previous Work • Adaptive Schedulers • IBM Power 5 Architecture • A Multi-Level Hierarchical Scheduler • Evaluation • Future Work
Simultaneous Multi-Threading • Architecture • Several threads per physical processor • Threads share • Caches • Registers • Functional Units
OpenMP • OpenMP • A standard API for shared memory programming • Add directives for parallel regions • Standard Loop Schedulers • Static • Dynamic • Guided • Runtime
#pragma omp parallel for shared(a, b) private(i, j) schedule(runtime) for ( i = 0; i < 100; i ++ ) { for ( j = 0; j < 100; j ++) { a[i , j] = a[i , j] + b[i , j]; } } An example of a parallel loop in C code. (Similar in Fortran) …….. …….. …. …. …. …. …. …….. i j OpenMP API T0 Tn
Motivation • OpenMP Applications • Designed for SMP systems • Not aware of HT technology • Understanding and controlling performance of OpenMP applications on SMT processors is not trivial • Important performance issues on SMP system with SMT nodes • Inter-thread data locality • Instruction Mix • SMT-related Load Balance
Scaling (Spec & NAS) 4 Intel Xeon Processors with Hyperthreading 1 Thread per Processor 1-2 Threads per Processor
Why do they scale poorly? • Inter-thread data locality • cache misses • Instruction Mix • functional units sharing • benefit gained this way may outweigh cache misses • SMT-related Load Balance We should balance work loads well among: • processors • threads running on the same physical processor.
Previous Work:Runtime Adaptive Scheduler • Hierarchical Scheduling • Upper level scheduler • Lower level scheduler • Select scheduler and the number of threads to run at runtime • One thread per physical processor • Two threads per physical processor
…….. …….. …. …. …. …. …. …….. …. Traditional Scheduling Static Scheduling Dynamic Scheduling i i …….. …….. j j …. …. …. …. …….. T0 Tn T0 Tn Ti Tk
…. …. …. …. Hierarchical Scheduling i …….. …….. …….. …….. Dynamic Scheduling …. …. …. …. …. …. …. j …….. …….. P0 …….. Pi Static Scheduling T00 T01 Ti0 Ti1
Why can we benefit fromruntime scheduler selection? • Many parallel loops in OpenMP applications are executed again and again. • Example for (k = 1; k<100; k++) { …………. calculate(); ………….} void calculate () {#pragma omp parallel for schedule(runtime) for (i = 1; i<100; i++) { ……………; // calculation }}
Adaptive Schedulers • Region Based Scheduler • Select loop schedulers at runtime • Parallel loops in one parallel region have to use the same scheduler which may not be the best • Loop Based Scheduler • Higher runtime overhead • More accurate loop scheduler for each parallel loop
Sample from NAS2004 !$omp parallel default(shared) private(i,j,k) !$omp do schedule(runtime) do j=1,lastrow-firstrow+1 do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1 enddo enddo !$omp end do nowait !$omp do schedule(runtime) do i = 1, na+1 x(i) = 1.0D0 enddo !$omp end do nowait !$omp do schedule(runtime) do j=1, lastcol-firstcol+1 q(j) = 0.0d0 z(j) = 0.0d0 r(j) = 0.0d0 p(j) = 0.0d0 enddo !$omp end do nowait !$omp end parallel loop based scheduler picks a scheduler region based scheduler picks one scheduler that applies to all three loops loop based scheduler picks a scheduler loop based scheduler picks a scheduler
Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 Static Scheduler P0 P2 P1 P3 T0 T1 T2 T3
Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 Dynamic Scheduler P0 P2 P1 P3 T0 T1 T2 T3
Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 Affinity Scheduler P0 P2 P1 P3 T0 T1 T2 T3
M1 P0 P0 P1 P1 T1 T2 T3 T4 T5 T6 T7 Runtime Loop Scheduler Selection Phase 1:Made a decision on upper level scheduler, try lower level scheduler, run with 8 threads………… Affinity Scheduler Static T0
Sample from NAS2004 !$omp parallel default(shared) private(i,j,k) !$omp do schedule(runtime) do j=1,lastrow-firstrow+1 do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1 enddo enddo !$omp end do nowait !$omp do schedule(runtime) do i = 1, na+1 x(i) = 1.0D0 enddo !$omp end do nowait !$omp do schedule(runtime) do j=1, lastcol-firstcol+1 q(j) = 0.0d0 z(j) = 0.0d0 r(j) = 0.0d0 p(j) = 0.0d0 enddo !$omp end do nowait !$omp end parallel Static-Static, 8 threads TSS, 4 threads TSS, 4 threads
Hardware Counter Scheduler • Motivation • The RBS and LBS has runtime overhead. They will work even better if we can reduce the overhead as much as possible • Algorithm • Try different schedulers on parallel loops on a subset of the benchmarks using training data • Use the characteristic: cache miss, number of floating point operations, number of micro-ops, load imbalance and the best scheduler for that loop as input • Feed the above data to classification software (we use C4.5) to build a decision tree • Apply this decision tree to a loop at runtime. Feed the runtime collected hardware counter data as input, and get the result – scheduler – as output.
IBM Power 5 • Technology: 130nm • Dual processor core • 8-way superscalar • Simultaneous Multi-Threaded (SMT) core • Up to 2 virtual processors • 24% area growth per core for SMT • Natural extension to Power 4 design
Single Thread • Single Thread has advantage when executing unit limited applications • Floating or fixed point intensive workloads • Extra resources necessary for SMT provide higher performance benefit when dedicated to a single thread • Data locality on one SMT core is better with single thread for some applications
Power 5 Multi-Chip Module (MCM) • Or Multi-Chipped Monster • 4 processor chips • 2 processors per chip • 4 L3 cache chips
Power5 64-way Plane Topology • Each MCM has 4 inter-connected processor chips • Each processor chip has two processors on chip • Each processor has SMT technology therefore two threads can be executed on it simultaneously
Multi-Level Scheduler Loop Iterations 1st LevelScheduler ……. ……. Iterations for Module 1 Iterations for Module i Iterations for Module n ………………. 2nd LevelScheduler 2nd LevelScheduler Iterations for Processor 1 ………………. Iterations for Processor m Iterations for Processor 1 Iterations for Processor m ………………. 3rd LevelScheduler 3rd LevelScheduler ………………. Iterations for Thread 1 Iterations for Thread k Iterations for Thread 1 Iterations for Thread k
OpenMP Implementation • Outline Technique • New subroutines created with body of each parallel construct • Runtime routines receives as a parameter the address of the outlined procedure
Source Code: #pragma omp parallel for shared(a,b) private(i) for ( i = 0; i < 100; i ++ ) { a = a + b; } Runtime Library 1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…) long main { _xlsmpParallelDoSetup_TPO(…) } while (still iterations left, go to get some iterations for this thread) { ………… call main@OL@1(...); …………. } void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return; } Outlined Functions
Source Code: #pragma omp parallel for shared(a,b) private(i) for ( i = 0; i < 100; i ++ ) { a = a + b; } Runtime Library 1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…) long main { _xlsmpParallelDoSetup_TPO(…) } while (hier_sched(…))) { ………… call main@OL@1(...); …………. } void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return; } Outlined Functions
Root Guided M0 M1 Static Cyclic P0 P0 P1 P1 T0 T1 T2 T3 T4 T5 T6 T7 • Lookup its parents iteration list to see if there is any iteration available; if yes, get some iterations from the 2nd level scheduler and return • Look one level up, grab the lock for its group, and seek more iterations from the upper level using the upper level loop scheduler (a recursive function call) till it gets some iteration or the whole loop ends
Hierarchical Scheduler • Guided as the 1st level scheduler • Balance work loads among processors • Reduce runtime overhead • Static Cyclic as the 2nd level scheduler • Improve cache locality • Reduce runtime overhead …. …. T0 T0 T1 T0 T1 T0 T1 T0 T1 T1 Iteration space dividing using standard static scheduling Iteration space dividing using static cyclic scheduling
Evaluation • IBM Power 5 System • 4 Power 5 1904 MHz SMT processors • 31872 M memory • Operating System • AIX 5.3 • Compiler: • IBM XL C/C++, XL Fortran compiler • Benchmark • SpecOMP2001
Scalability of IBM Power 5 SMT Processors 1 through 8 threads
Evaluation on Power 5Execution Time Normalized to Default (Static) Scheduler
Conclusion • Standard schedulers are not aware of SMT technology • Adaptive hierarchical schedulers take SMT specific characteristics into account, which could make OpenMP API (software) and SMT technology (hardware) work better together. • OpenMP parallel applications running on Power 5 architecture with SMT has the same problem • Multi-level hierarchical scheduler designed for IBM Power 5 achieves an average improvement over the default loop scheduler of 3% on SPEC OMP2001 • Large improvements of 7% and 11% on some benchmarks • Improves on average over all other standard OpenMP loop schedulers by at least 2%
Future Work • Evaluate multi-level hierarchical scheduler on a larger system with 32 SMT processors (with MCM) • Explore performance on auto-parallelized benchmarks (SPEC CPU FP) • Examine mechanisms for determining best scheduler configuration at compile-time • Explore the use of helper threads on Power 5 • Cache prefetching
(A cache miss comparison chart will be shown here) • If find a way to calculate the overall L2 load/store miss generally. • If not, will show the overhead of this optimization from the tprof data.
Only one decision tree is built offline, before executing the program Apply that decision tree to loops at runtime without changing the tree Make a decision on which scheduler we should use with only one run of each loop, which greatly reduces runtime scheduling overhead uops <= 3.62885e+08 : | cachemiss <= 111979 : | | uops > 748339 : static-4 | | uops <= 748339 : | | | l/s <= 167693 : static-4 ( | | | l/s > 167693 : static-static | cachemiss > 111979 : | | floatpoint <= 1.52397e+07 : | | | cachemiss <= 384690 : | | | | uops <= 2.06431e+07 : static-static | | | | uops > 2.06431e+07 : | | | | | imbalance <= 1330 : afs-static | | | | | imbalance > 1330 : | | | | | | cachemiss <= 301582 : afs-4 | | | | | | cachemiss > 301582 : guided-static ……………………………. uops > 3.62885e+08 : | l/s > 7.22489e+08 : static-4 | l/s <= 7.22489e+08 : | | imbalance <= 32236 : static-4 | | imbalance > 32236 : | | | floatpoint <= 5.34465e+07 : static-4 | | | floatpoint > 5.34465e+07 : | | | | floatpoint <= 1.20539e+08 : tss-4 | | | | floatpoint > 1.20539e+08 : | | | | | floatpoint <= 1.45588e+08 : static-4 | | | | | floatpoint > 1.45588e+08 : tss-4 END hardware-counter scheduling END hardware-counter scheduling Decision Tree
(Load imbalance comparison chart will be shown here) • Generating……..