1 / 24

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon. D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center. Overview. Blue Horizon Hardware Motivation for this work Two methods of hybrid programming Fine grain results

penha
Download Presentation

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center

  2. Overview • Blue Horizon Hardware • Motivation for this work • Two methods of hybrid programming • Fine grain results • A word on coarse grain techniques • Coarse grain results • Time variability • Effects of thread binding • Final Conclusions

  3. Blue Horizon Hardware • 144 IBM SP High Nodes • Each node: • 8-way SMP • 4 GB memory • crossbar • Each processor: • Power3 222 MHz • 4 Flop/cycle • Aggregate peak 1.002 Tflop/s • Compilers: • IBM mpxlf_r, version 7.0.1 • KAI guidef90, version 3.9

  4. Blue Horizon Hardware • Interconnect (between nodes): • Currently: • 115 MB/s • 4 MPI tasks/node Must use OpenMP to utilize all processors • Soon: • 500 MB/s • 8 MPI tasks/node Can use OpenMP to supplement MPI (if it’s worth it)

  5. Hybrid Programming: why use it? • Non-performance-related reasons • Avoid replication of data on the node • Performance-related reasons: • Avoid latency of MPI on the node • Avoid unnecessary data copies inside the node • Reduce latency of MPI calls between the nodes • Decrease global MPI operations (reduction, all-to-all) • The price to pay: • OpenMP Overheads • False sharing Is it really worth trying?

  6. Hybrid Programming • Two methods of combining MPI and OpenMP in parallel programs Fine grainCoarse grain main program ! MPI initialization .... ! cpu intensive loop !$OMP PARALLEL DO do i=1,n !work end do .... end main program !MPI initialization !$OMP PARALLEL .... do i=1,n !work end do .... !$OMP END PARALLEL end

  7. Hybrid programming Fine grain approach • Easy to implement • Performance: low due to overheads of OpenMP directives (OMP PARALLEL DO) Coarse grain approach • Time-consuming implementation • Performance: less overhead for thread creation

  8. Hybrid NPB using fine grain parallelism • CG, MG, and FT suites of NAS Parallel Benchmarks (NPB). Suite name# loops parallelized CG - Conjugate Gradient 18 MG - Multi-Grid 50 FT - Fourier Transform 8 • Results shown are the best of 5-10 runs • Complete results at http://www.sdsc.edu/SciComp/PAA/Benchmarks/Hybrid/hybrid.html

  9. Fine grain results - CG (A&B class)

  10. Fine grain results - MG (A&B class)

  11. Fine grain results - MG (C class)

  12. Fine grain results - FT (A&B class)

  13. Fine grain results - FT (C class)

  14. Task 1 Task 2 Thread 1 Thread 2 Task 3 Task 4 Hybrid NPB using coarse grain parallelism: MG suite Overview of the method

  15. Coarse grain programming methodology • Start with MPI code • Each MPI task spawns threads once in the beginning • Serial work (initialization etc) and MPI calls are done inside MASTER or SINGLE region • Main arrays are global • Work distribution: each thread gets a chunk of the array based on its number (omp_get_thread_num()). In this work, one-dimensional blocking. • Avoid using OMP DO • Careful with scoping and synchronization

  16. Coarse grain results - MG (A class)

  17. Coarse grain results - MG (C class)

  18. 64 64 64 64 2x4 1x8 1x8 2x4 Coarse grain results - MG (C class) • Full node results MPI Tasks x OpenMP Threads Max MOPS/CPU Min MOPS/CPU # of SMP Nodes 75.7 19.1 8 4x2 92.6 14.9 8 2x4 84.2 13.6 8 1x8 49.5 18.6 64 4x2 15.6 3.7 64 4x2 21.2 5.3 56.8 42.3 15.4 5.6 8.2 2.2

  19. Variability • 2 -- 5 times (on 64 nodes) • Seen mostly when the full node is used • Seen both in fine grain and coarse grain runs • Seen both with IBM and KAI compiler • Seen in runs on the same set of nodes as well as between different sets • On a large number of nodes, the average performance suffers a lot • Confirmed in micro-study of OpenMP on 1 node

  20. OpenMP on 1 node microbenchmark results http://www.sdsc.edu/SciComp/PAA/ Benchmarks/Open/open.html

  21. Thread binding Question: is variability related to thread migration? • A study on 1 node: • Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds • Monitor processor id and run time for each thread • Repeat 100 times • Threads bound OR not bound

  22. Thread binding Results for OMP_NUM_THREADS=8 • Without binding, threads migrate in about 15% of the runs • With thread binding turned on there was no migration • 3% of iterations had threads with runtimes > 2.0 sec., a 25% slowdown • Slowdown occurs with/without binding • Effect of single thread slowdown • Probability that complete calculation will be slowed P=1-(1-c%)^M with c=3% M=144 nodes of Blue Horizon P=0.9876 probability overall results slowed by 25%

  23. Thread binding • Calculation was rerun • OMP_NUM_THREADS = 7 • 12.5% reduction in computational power • No threads showed a slowdown, all ran in about 1.6 seconds • Summary • OMP_NUM_THREADS = 7 • yields 12.5% reduction in computational power • OMP_NUM_THREADS = 8 • 0.9876 probability overall results slowed by 25% independent of thread binding

  24. Overall Conclusions Based on our study of NPB on Blue Horizon: • Fine grain hybrid approach is generally worse than pure MPI • Coarse grain approach for MG is comparable with pure MPI or slightly better • Coarse grain approach is time and effort consuming • Coarse grain techniques are given • Big variability when using the full node. Until this is fixed, recommend to use less than 8 threads • Thread binding does not influence performance

More Related