On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon D. Pekurovsky, L. Nett-Carrington, D. Holland, T. Kaiser San Diego Supercomputing Center

Overview • Blue Horizon Hardware • Motivation for this work • Two methods of hybrid programming • Fine grain results • A word on coarse grain techniques • Coarse grain results • Time variability • Effects of thread binding • Final Conclusions

Blue Horizon Hardware • 144 IBM SP High Nodes • Each node: • 8-way SMP • 4 GB memory • crossbar • Each processor: • Power3 222 MHz • 4 Flop/cycle • Aggregate peak 1.002 Tflop/s • Compilers: • IBM mpxlf_r, version 7.0.1 • KAI guidef90, version 3.9

Blue Horizon Hardware • Interconnect (between nodes): • Currently: • 115 MB/s • 4 MPI tasks/node Must use OpenMP to utilize all processors • Soon: • 500 MB/s • 8 MPI tasks/node Can use OpenMP to supplement MPI (if it’s worth it)

Hybrid Programming: why use it? • Non-performance-related reasons • Avoid replication of data on the node • Performance-related reasons: • Avoid latency of MPI on the node • Avoid unnecessary data copies inside the node • Reduce latency of MPI calls between the nodes • Decrease global MPI operations (reduction, all-to-all) • The price to pay: • OpenMP Overheads • False sharing Is it really worth trying?

Hybrid Programming • Two methods of combining MPI and OpenMP in parallel programs Fine grainCoarse grain main program ! MPI initialization .... ! cpu intensive loop !$OMP PARALLEL DO do i=1,n !work end do .... end main program !MPI initialization !$OMP PARALLEL .... do i=1,n !work end do .... !$OMP END PARALLEL end

Hybrid programming Fine grain approach • Easy to implement • Performance: low due to overheads of OpenMP directives (OMP PARALLEL DO) Coarse grain approach • Time-consuming implementation • Performance: less overhead for thread creation

Hybrid NPB using fine grain parallelism • CG, MG, and FT suites of NAS Parallel Benchmarks (NPB). Suite name# loops parallelized CG - Conjugate Gradient 18 MG - Multi-Grid 50 FT - Fourier Transform 8 • Results shown are the best of 5-10 runs • Complete results at http://www.sdsc.edu/SciComp/PAA/Benchmarks/Hybrid/hybrid.html

Fine grain results - CG (A&B class)

Fine grain results - MG (A&B class)

Fine grain results - MG (C class)

Fine grain results - FT (A&B class)

Fine grain results - FT (C class)

Task 1 Task 2 Thread 1 Thread 2 Task 3 Task 4 Hybrid NPB using coarse grain parallelism: MG suite Overview of the method

Coarse grain programming methodology • Start with MPI code • Each MPI task spawns threads once in the beginning • Serial work (initialization etc) and MPI calls are done inside MASTER or SINGLE region • Main arrays are global • Work distribution: each thread gets a chunk of the array based on its number (omp_get_thread_num()). In this work, one-dimensional blocking. • Avoid using OMP DO • Careful with scoping and synchronization

Coarse grain results - MG (A class)

Coarse grain results - MG (C class)

64 64 64 64 2x4 1x8 1x8 2x4 Coarse grain results - MG (C class) • Full node results MPI Tasks x OpenMP Threads Max MOPS/CPU Min MOPS/CPU # of SMP Nodes 75.7 19.1 8 4x2 92.6 14.9 8 2x4 84.2 13.6 8 1x8 49.5 18.6 64 4x2 15.6 3.7 64 4x2 21.2 5.3 56.8 42.3 15.4 5.6 8.2 2.2

Variability • 2 -- 5 times (on 64 nodes) • Seen mostly when the full node is used • Seen both in fine grain and coarse grain runs • Seen both with IBM and KAI compiler • Seen in runs on the same set of nodes as well as between different sets • On a large number of nodes, the average performance suffers a lot • Confirmed in micro-study of OpenMP on 1 node

OpenMP on 1 node microbenchmark results http://www.sdsc.edu/SciComp/PAA/ Benchmarks/Open/open.html

Thread binding Question: is variability related to thread migration? • A study on 1 node: • Each OpenMP thread performs an independent matrix inversion taking about 1.6 seconds • Monitor processor id and run time for each thread • Repeat 100 times • Threads bound OR not bound

Thread binding Results for OMP_NUM_THREADS=8 • Without binding, threads migrate in about 15% of the runs • With thread binding turned on there was no migration • 3% of iterations had threads with runtimes > 2.0 sec., a 25% slowdown • Slowdown occurs with/without binding • Effect of single thread slowdown • Probability that complete calculation will be slowed P=1-(1-c%)^M with c=3% M=144 nodes of Blue Horizon P=0.9876 probability overall results slowed by 25%

Thread binding • Calculation was rerun • OMP_NUM_THREADS = 7 • 12.5% reduction in computational power • No threads showed a slowdown, all ran in about 1.6 seconds • Summary • OMP_NUM_THREADS = 7 • yields 12.5% reduction in computational power • OMP_NUM_THREADS = 8 • 0.9876 probability overall results slowed by 25% independent of thread binding

Overall Conclusions Based on our study of NPB on Blue Horizon: • Fine grain hybrid approach is generally worse than pure MPI • Coarse grain approach for MG is comparable with pure MPI or slightly better • Coarse grain approach is time and effort consuming • Coarse grain techniques are given • Big variability when using the full node. Until this is fixed, recommend to use less than 8 threads • Thread binding does not influence performance

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

Presentation Transcript

Hybrid cars – the value of patent strategies in innovation

Hybrid Soft Computing: Where Are We Going?

FEASIBILITY STUDY OF HYBRID WOOD STEEL STRUCTURES

Integer Programming

C++ Basics

INTRODUCTION TO FUNCTIONAL PROGRAMMING

HORIZON 2020 WORK PROGRAMME 201 5

Chapter 13 Blue Cross Blue Shield

Introduction to PGAS (UPC and CAF) and Hybrid for Multicore Programming

A Hybrid Approach to Applying Mathematical Reasoning in Computer Science Courses

Red vs. Blue?

If you only have 5 minutes… PHYSICAL ASSESSMENT PEARLS

Drawing In One-Point Perspective

Introduction to Programming

Integrating Operations Research Algorithms in Constraint Programming

CSE-321 Programming Languages Introduction to Functional Programming

Book 2 Unit 7

Network Programming