240 likes | 364 Views
On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon. D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center. Overview. Blue Horizon Hardware Motivation for this work Two methods of hybrid programming Fine grain results
E N D
On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center
Overview • Blue Horizon Hardware • Motivation for this work • Two methods of hybrid programming • Fine grain results • A word on coarse grain techniques • Coarse grain results • Time variability • Effects of thread binding • Final Conclusions
Blue Horizon Hardware • 144 IBM SP High Nodes • Each node: • 8-way SMP • 4 GB memory • crossbar • Each processor: • Power3 222 MHz • 4 Flop/cycle • Aggregate peak 1.002 Tflop/s • Compilers: • IBM mpxlf_r, version 7.0.1 • KAI guidef90, version 3.9
Blue Horizon Hardware • Interconnect (between nodes): • Currently: • 115 MB/s • 4 MPI tasks/node Must use OpenMP to utilize all processors • Soon: • 500 MB/s • 8 MPI tasks/node Can use OpenMP to supplement MPI (if it’s worth it)
Hybrid Programming: why use it? • Non-performance-related reasons • Avoid replication of data on the node • Performance-related reasons: • Avoid latency of MPI on the node • Avoid unnecessary data copies inside the node • Reduce latency of MPI calls between the nodes • Decrease global MPI operations (reduction, all-to-all) • The price to pay: • OpenMP Overheads • False sharing Is it really worth trying?
Hybrid Programming • Two methods of combining MPI and OpenMP in parallel programs Fine grainCoarse grain main program ! MPI initialization .... ! cpu intensive loop !$OMP PARALLEL DO do i=1,n !work end do .... end main program !MPI initialization !$OMP PARALLEL .... do i=1,n !work end do .... !$OMP END PARALLEL end
Hybrid programming Fine grain approach • Easy to implement • Performance: low due to overheads of OpenMP directives (OMP PARALLEL DO) Coarse grain approach • Time-consuming implementation • Performance: less overhead for thread creation
Hybrid NPB using fine grain parallelism • CG, MG, and FT suites of NAS Parallel Benchmarks (NPB). Suite name# loops parallelized CG - Conjugate Gradient 18 MG - Multi-Grid 50 FT - Fourier Transform 8 • Results shown are the best of 5-10 runs • Complete results at http://www.sdsc.edu/SciComp/PAA/Benchmarks/Hybrid/hybrid.html
Task 1 Task 2 Thread 1 Thread 2 Task 3 Task 4 Hybrid NPB using coarse grain parallelism: MG suite Overview of the method
Coarse grain programming methodology • Start with MPI code • Each MPI task spawns threads once in the beginning • Serial work (initialization etc) and MPI calls are done inside MASTER or SINGLE region • Main arrays are global • Work distribution: each thread gets a chunk of the array based on its number (omp_get_thread_num()). In this work, one-dimensional blocking. • Avoid using OMP DO • Careful with scoping and synchronization
64 64 64 64 2x4 1x8 1x8 2x4 Coarse grain results - MG (C class) • Full node results MPI Tasks x OpenMP Threads Max MOPS/CPU Min MOPS/CPU # of SMP Nodes 75.7 19.1 8 4x2 92.6 14.9 8 2x4 84.2 13.6 8 1x8 49.5 18.6 64 4x2 15.6 3.7 64 4x2 21.2 5.3 56.8 42.3 15.4 5.6 8.2 2.2
Variability • 2 -- 5 times (on 64 nodes) • Seen mostly when the full node is used • Seen both in fine grain and coarse grain runs • Seen both with IBM and KAI compiler • Seen in runs on the same set of nodes as well as between different sets • On a large number of nodes, the average performance suffers a lot • Confirmed in micro-study of OpenMP on 1 node
OpenMP on 1 node microbenchmark results http://www.sdsc.edu/SciComp/PAA/ Benchmarks/Open/open.html
Thread binding Question: is variability related to thread migration? • A study on 1 node: • Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds • Monitor processor id and run time for each thread • Repeat 100 times • Threads bound OR not bound
Thread binding Results for OMP_NUM_THREADS=8 • Without binding, threads migrate in about 15% of the runs • With thread binding turned on there was no migration • 3% of iterations had threads with runtimes > 2.0 sec., a 25% slowdown • Slowdown occurs with/without binding • Effect of single thread slowdown • Probability that complete calculation will be slowed P=1-(1-c%)^M with c=3% M=144 nodes of Blue Horizon P=0.9876 probability overall results slowed by 25%
Thread binding • Calculation was rerun • OMP_NUM_THREADS = 7 • 12.5% reduction in computational power • No threads showed a slowdown, all ran in about 1.6 seconds • Summary • OMP_NUM_THREADS = 7 • yields 12.5% reduction in computational power • OMP_NUM_THREADS = 8 • 0.9876 probability overall results slowed by 25% independent of thread binding
Overall Conclusions Based on our study of NPB on Blue Horizon: • Fine grain hybrid approach is generally worse than pure MPI • Coarse grain approach for MG is comparable with pure MPI or slightly better • Coarse grain approach is time and effort consuming • Coarse grain techniques are given • Big variability when using the full node. Until this is fixed, recommend to use less than 8 threads • Thread binding does not influence performance