1 / 45

A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture

A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture. Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto Lab 6-Nov-14. Agenda. Background Motivation Previous Work Adaptive Schedulers IBM Power 5 Architecture

Download Presentation

A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto Lab 6-Nov-14

  2. Agenda • Background • Motivation • Previous Work • Adaptive Schedulers • IBM Power 5 Architecture • A Multi-Level Hierarchical Scheduler • Evaluation • Future Work

  3. Simultaneous Multi-Threading • Architecture • Several threads per physical processor • Threads share • Caches • Registers • Functional Units

  4. Power 5 SMT

  5. OpenMP • OpenMP • A standard API for shared memory programming • Add directives for parallel regions • Standard Loop Schedulers • Static • Dynamic • Guided • Runtime

  6. #pragma omp parallel for shared(a, b) private(i, j) schedule(runtime) for ( i = 0; i < 100; i ++ ) { for ( j = 0; j < 100; j ++) { a[i , j] = a[i , j] + b[i , j]; } } An example of a parallel loop in C code. (Similar in Fortran) …….. …….. …. …. …. …. …. …….. i j OpenMP API T0 Tn

  7. Motivation • OpenMP Applications • Designed for SMP systems • Not aware of HT technology • Understanding and controlling performance of OpenMP applications on SMT processors is not trivial • Important performance issues on SMP system with SMT nodes • Inter-thread data locality • Instruction Mix • SMT-related Load Balance

  8. Scaling (Spec & NAS) 4 Intel Xeon Processors with Hyperthreading 1 Thread per Processor 1-2 Threads per Processor

  9. Why do they scale poorly? • Inter-thread data locality • cache misses • Instruction Mix • functional units sharing • benefit gained this way may outweigh cache misses • SMT-related Load Balance We should balance work loads well among: • processors • threads running on the same physical processor.

  10. Previous Work:Runtime Adaptive Scheduler • Hierarchical Scheduling • Upper level scheduler • Lower level scheduler • Select scheduler and the number of threads to run at runtime • One thread per physical processor • Two threads per physical processor

  11. Two-Level Hierarchical Scheduler

  12. …….. …….. …. …. …. …. …. …….. …. Traditional Scheduling Static Scheduling Dynamic Scheduling i i …….. …….. j j …. …. …. …. …….. T0 Tn T0 Tn Ti Tk

  13. …. …. …. …. Hierarchical Scheduling i …….. …….. …….. …….. Dynamic Scheduling …. …. …. …. …. …. …. j …….. …….. P0 …….. Pi Static Scheduling T00 T01 Ti0 Ti1

  14. Why can we benefit fromruntime scheduler selection? • Many parallel loops in OpenMP applications are executed again and again. • Example for (k = 1; k<100; k++) { …………. calculate(); ………….} void calculate () {#pragma omp parallel for schedule(runtime) for (i = 1; i<100; i++) { ……………; // calculation }}

  15. Adaptive Schedulers • Region Based Scheduler • Select loop schedulers at runtime • Parallel loops in one parallel region have to use the same scheduler which may not be the best • Loop Based Scheduler • Higher runtime overhead • More accurate loop scheduler for each parallel loop

  16. Sample from NAS2004 !$omp parallel default(shared) private(i,j,k) !$omp do schedule(runtime) do j=1,lastrow-firstrow+1 do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1 enddo enddo !$omp end do nowait !$omp do schedule(runtime) do i = 1, na+1 x(i) = 1.0D0 enddo !$omp end do nowait !$omp do schedule(runtime) do j=1, lastcol-firstcol+1 q(j) = 0.0d0 z(j) = 0.0d0 r(j) = 0.0d0 p(j) = 0.0d0 enddo !$omp end do nowait !$omp end parallel loop based scheduler picks a scheduler region based scheduler picks one scheduler that applies to all three loops loop based scheduler picks a scheduler loop based scheduler picks a scheduler

  17. Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 Static Scheduler P0 P2 P1 P3 T0 T1 T2 T3

  18. Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 Dynamic Scheduler P0 P2 P1 P3 T0 T1 T2 T3

  19. Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 Affinity Scheduler P0 P2 P1 P3 T0 T1 T2 T3

  20. M1 P0 P0 P1 P1 T1 T2 T3 T4 T5 T6 T7 Runtime Loop Scheduler Selection Phase 1:Made a decision on upper level scheduler, try lower level scheduler, run with 8 threads………… Affinity Scheduler Static T0

  21. Sample from NAS2004 !$omp parallel default(shared) private(i,j,k) !$omp do schedule(runtime) do j=1,lastrow-firstrow+1 do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1 enddo enddo !$omp end do nowait !$omp do schedule(runtime) do i = 1, na+1 x(i) = 1.0D0 enddo !$omp end do nowait !$omp do schedule(runtime) do j=1, lastcol-firstcol+1 q(j) = 0.0d0 z(j) = 0.0d0 r(j) = 0.0d0 p(j) = 0.0d0 enddo !$omp end do nowait !$omp end parallel Static-Static, 8 threads TSS, 4 threads TSS, 4 threads

  22. Hardware Counter Scheduler • Motivation • The RBS and LBS has runtime overhead. They will work even better if we can reduce the overhead as much as possible • Algorithm • Try different schedulers on parallel loops on a subset of the benchmarks using training data • Use the characteristic: cache miss, number of floating point operations, number of micro-ops, load imbalance and the best scheduler for that loop as input • Feed the above data to classification software (we use C4.5) to build a decision tree • Apply this decision tree to a loop at runtime. Feed the runtime collected hardware counter data as input, and get the result – scheduler – as output.

  23. 4 Intel Xeon Processors with Hyperthreading

  24. 4 Intel Xeon Processors with Hyperthreading

  25. IBM Power 5 • Technology: 130nm • Dual processor core • 8-way superscalar • Simultaneous Multi-Threaded (SMT) core • Up to 2 virtual processors • 24% area growth per core for SMT • Natural extension to Power 4 design

  26. Single Thread • Single Thread has advantage when executing unit limited applications • Floating or fixed point intensive workloads • Extra resources necessary for SMT provide higher performance benefit when dedicated to a single thread • Data locality on one SMT core is better with single thread for some applications

  27. Power 5 Multi-Chip Module (MCM) • Or Multi-Chipped Monster • 4 processor chips • 2 processors per chip • 4 L3 cache chips

  28. Power5 64-way Plane Topology • Each MCM has 4 inter-connected processor chips • Each processor chip has two processors on chip • Each processor has SMT technology therefore two threads can be executed on it simultaneously

  29. Multi-Level Scheduler Loop Iterations 1st LevelScheduler ……. ……. Iterations for Module 1 Iterations for Module i Iterations for Module n ………………. 2nd LevelScheduler 2nd LevelScheduler Iterations for Processor 1 ………………. Iterations for Processor m Iterations for Processor 1 Iterations for Processor m ………………. 3rd LevelScheduler 3rd LevelScheduler ………………. Iterations for Thread 1 Iterations for Thread k Iterations for Thread 1 Iterations for Thread k

  30. OpenMP Implementation • Outline Technique • New subroutines created with body of each parallel construct • Runtime routines receives as a parameter the address of the outlined procedure

  31. Source Code: #pragma omp parallel for shared(a,b) private(i) for ( i = 0; i < 100; i ++ ) { a = a + b; } Runtime Library 1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…) long main { _xlsmpParallelDoSetup_TPO(…) } while (still iterations left, go to get some iterations for this thread) { ………… call main@OL@1(...); …………. } void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return; } Outlined Functions

  32. Source Code: #pragma omp parallel for shared(a,b) private(i) for ( i = 0; i < 100; i ++ ) { a = a + b; } Runtime Library 1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…) long main { _xlsmpParallelDoSetup_TPO(…) } while (hier_sched(…))) { ………… call main@OL@1(...); …………. } void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return; } Outlined Functions

  33. Root Guided M0 M1 Static Cyclic P0 P0 P1 P1 T0 T1 T2 T3 T4 T5 T6 T7 • Lookup its parents iteration list to see if there is any iteration available; if yes, get some iterations from the 2nd level scheduler and return • Look one level up, grab the lock for its group, and seek more iterations from the upper level using the upper level loop scheduler (a recursive function call) till it gets some iteration or the whole loop ends

  34. Hierarchical Scheduler • Guided as the 1st level scheduler • Balance work loads among processors • Reduce runtime overhead • Static Cyclic as the 2nd level scheduler • Improve cache locality • Reduce runtime overhead …. …. T0 T0 T1 T0 T1 T0 T1 T0 T1 T1 Iteration space dividing using standard static scheduling Iteration space dividing using static cyclic scheduling

  35. Evaluation • IBM Power 5 System • 4 Power 5 1904 MHz SMT processors • 31872 M memory • Operating System • AIX 5.3 • Compiler: • IBM XL C/C++, XL Fortran compiler • Benchmark • SpecOMP2001

  36. Scalability of IBM Power 5 SMT Processors 1 through 8 threads

  37. Evaluation on Power 5Execution Time Normalized to Default (Static) Scheduler

  38. Conclusion • Standard schedulers are not aware of SMT technology • Adaptive hierarchical schedulers take SMT specific characteristics into account, which could make OpenMP API (software) and SMT technology (hardware) work better together. • OpenMP parallel applications running on Power 5 architecture with SMT has the same problem • Multi-level hierarchical scheduler designed for IBM Power 5 achieves an average improvement over the default loop scheduler of 3% on SPEC OMP2001 • Large improvements of 7% and 11% on some benchmarks • Improves on average over all other standard OpenMP loop schedulers by at least 2%

  39. Future Work • Evaluate multi-level hierarchical scheduler on a larger system with 32 SMT processors (with MCM) • Explore performance on auto-parallelized benchmarks (SPEC CPU FP) • Examine mechanisms for determining best scheduler configuration at compile-time • Explore the use of helper threads on Power 5 • Cache prefetching

  40. Thank You~

  41. (A cache miss comparison chart will be shown here) • If find a way to calculate the overall L2 load/store miss generally. • If not, will show the overhead of this optimization from the tprof data.

  42. Schedulers’ Speedup on 4 threads

  43. Scheduler’s Speedup on 8 Threads

  44. Only one decision tree is built offline, before executing the program Apply that decision tree to loops at runtime without changing the tree Make a decision on which scheduler we should use with only one run of each loop, which greatly reduces runtime scheduling overhead uops <= 3.62885e+08 : | cachemiss <= 111979 : | | uops > 748339 : static-4 | | uops <= 748339 : | | | l/s <= 167693 : static-4 ( | | | l/s > 167693 : static-static | cachemiss > 111979 : | | floatpoint <= 1.52397e+07 : | | | cachemiss <= 384690 : | | | | uops <= 2.06431e+07 : static-static | | | | uops > 2.06431e+07 : | | | | | imbalance <= 1330 : afs-static | | | | | imbalance > 1330 : | | | | | | cachemiss <= 301582 : afs-4 | | | | | | cachemiss > 301582 : guided-static ……………………………. uops > 3.62885e+08 : | l/s > 7.22489e+08 : static-4 | l/s <= 7.22489e+08 : | | imbalance <= 32236 : static-4 | | imbalance > 32236 : | | | floatpoint <= 5.34465e+07 : static-4 | | | floatpoint > 5.34465e+07 : | | | | floatpoint <= 1.20539e+08 : tss-4 | | | | floatpoint > 1.20539e+08 : | | | | | floatpoint <= 1.45588e+08 : static-4 | | | | | floatpoint > 1.45588e+08 : tss-4 END hardware-counter scheduling END hardware-counter scheduling Decision Tree

  45. (Load imbalance comparison chart will be shown here) • Generating……..

More Related