1 / 42

Hybrid OpenMP and MPI Programming

Hybrid OpenMP and MPI Programming. MPI vs. OpenMP. Pure MPI Pro: Portable to distributed and shared memory machines. Scales beyond one node No data placement problem Pure MPI Con: Difficult to develop and debug High latency, low bandwidth Explicit communication Large granularity

zev
Download Presentation

Hybrid OpenMP and MPI Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hybrid OpenMP and MPI Programming

  2. MPI vs. OpenMP • Pure MPI Pro: • Portable to distributed and shared memory machines. • Scales beyond one node • No data placement problem • Pure MPI Con: • Difficult to develop and debug • High latency, low bandwidth • Explicit communication • Large granularity • Difficult load balancing • Pure OpenMP Pro: • Easy to implement parallelism • Low latency, high bandwidth • Implicit Communication • Coarse and fine granularity • Dynamic load balancing • Pure OpenMP Con: • Only on shared memory machines • Scale within one node • Possible data placement problem • No specific thread order

  3. MPI vs. MPI+OpenMP Node MPI MPI+OpenMP https://www.sharcnet.ca/~jemmyhu/CES706/mpi+smp.swf

  4. Why Hybrid • Elegant in concept and architecture: using MPI across nodes and OpenMP within nodes. Good usage of shared memory system resource (memory, latency, and bandwidth). • Avoids the extra communication overhead with MPI within node. • OpenMP adds fine granularity (larger message sizes) and allows increased and/or dynamic load balancing. • Some problems have two-level parallelism naturally. • Some problems could only use restricted number of MPI tasks.

  5. Why Mixed OpenMP/MPI Code is Sometimes Slower? • OpenMP has less scalability due to implicit parallelism while MPI allows multi-dimensional blocking. • All threads are idleexcept one while MPI communication. • Need overlap comp and comm for better performance. • Critical Section for shared variables. • Thread creation overhead • Cache coherence, data placement. • Natural one level parallelism problems. • Pure OpenMP code performs worse than pure MPI within node. • Lack of optimized OpenMP compilers/libraries. • Positive and Negative experiences: • Positive: CAM, MM5, … • Negative: NAS, CG, PS, …

  6. A Pseudo Hybrid Code Program hybrid call MPI_INIT (ierr) call MPI_COMM_RANK (…) call MPI_COMM_SIZE (…) … some computation and MPI communication call OMP_SET_NUM_THREADS(4) !$OMP PARALLEL DO PRIVATE(i) !$OMP& SHARED(n) do i=1,n … computation enddo !$OMP END PARALLEL DO … some computation and MPI communication call MPI_FINALIZE (ierr) end

  7. Loop-based vs. SPMD SPMD: !$OMP PARALLEL PRIVATE(start, end, i) !$OMP& SHARED(a,b) num_thrds = omp_get_num_threads() thrd_id = omp_get_thread_num() start = n* thrd_id/num_thrds + 1 end = n*(thrd_id+1)/num_thrds do i = start, end a(i)=a(i)+b(i) enddo !$OMP END PARALLEL Loop-based: !$OMP PARALLEL DO PRIVATE(i) !$OMP& SHARED(a,b,n) do i=1,n a(i)=a(i)+b(i) enddo !$OMP END PARALLEL DO • SPMD code normally gives better performance than loop-based code, but more difficult to implement: • Less thread synchronization. • Less cache misses. • More compiler optimizations.

  8. Hybrid Parallelization Strategies • From sequential code, decompose with MPI first, then add OpenMP. • From OpenMP code, treat as serial code. • From MPI code, add OpenMP. • Simplest and least error-prone way is to use MPI outside parallel region, and allow only master thread to communicate between MPI tasks. • Could use MPI inside parallel region with thread-safe MPI.

  9. Examples • pi Calculation • Matrix-Vector Multiplication • 2D Laplace Equation • 2D Helmholtz Equation

  10. pi – MPI version #include <stdio.h> #include <stdlib.h> #include <mpi.h> /* MPI header file */ #define NUM_STEPS 100000000 int main(int argc, char *argv[]) { int nprocs; int myid; double start_time, end_time; int i; double x, pi; double sum = 0.0; double step = 1.0/(double) NUM_STEPS; /* initialize for MPI */ MPI_Init(&argc, &argv); /* starts MPI */ /* get number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &nprocs); /* get this process's number (ranges from 0 to nprocs - 1) */ MPI_Comm_rank(MPI_COMM_WORLD, &myid);

  11. /* record start time * start_time = MPI_Wtime(); /* do computation */ for (i=myid; i < NUM_STEPS; i += nprocs) { /* changed */ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } sum = step * sum; /* changed */ MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);/* added */ /* record end time */ end_time = MPI_Wtime(); /* print results */ if (myid == 0) { printf("parallel program results with %d processes:\n", nprocs); printf("pi = %g (%17.15f)\n",pi, pi); printf("time to compute = %g seconds\n", end_time - start_time); } /* clean up for MPI */ MPI_Finalize(); return 0; }

  12. OMP does not support vector reduction thread = process c = 0.0 do j = 1, n_loc !$OMP DO PARALLEL !$OMP SHARED(a,b), PRIVATE(i) !$OMP REDUCTION(+:c) do i = 1, nrows c(i) = c(i) + a(i,j)*b(i) enddo enddo call MPI_REDUCE_SCATTER(c) • OMP does not support vector reduction • Wrong answer since c is shared!

  13. Jacobi Solver – OpenMP subroutine jacobi (n,m,dx,dy,alpha,omega,u,f,tol,maxit) ... (while loop) #pragma omp parallel private(resid, i){ #pragma omp for for (j=0; j<m; j++) for (i=0; i<n; i++) uold[i + m*j] = u[i + m*j]; /* compute stencil, residual and update */ #pragma omp for reduction(+:error) for (j=1; j<m-1; j++) for (i=1; i<n-1; i++){ resid = ( ax * (uold[i-1 + m*j] + uold[i+1 + m*j]) + ay * (uold[i + m*(j-1)] + uold[i + m*(j+1)]) + b * uold[i + m*j] - f[i + m*j]) / b; /* update solution */ u[i + m*j] = uold[i + m*j] - omega * resid; /* accumulate residual error */ error =error + resid*resid; } } /* end parallel */ use omp_lib

  14. Jacobi Solver – MPI subroutine jacobi (n,m,mlo,mhi,dx,dy,alpha,omega,u,f,tol,maxit) …(communication, MPI_IRECV, MPI_ISEND, etc) do j=mlo+1,mhi-1 do i=1,n uold(i,j) = u(i,j) enddo enddo call MPI_WAITALL ( reqcnt, reqary, reqstat, ierr) if ( ierr .ne. 0 ) then do i = 1, reqcnt print *, me, ': ', i, (reqstat(j,i),j=1,MPI_STATUS_SIZE) end do end if ! Compute stencil, residual, & update do j = mlo+1,mhi-1 do i = 2,n-1 ! Evaluate residual resid = (ax*(uold(i-1,j) + uold(i+1,j)) & & + ay*(uold(i,j-1) + uold(i,j+1)) & & + b * uold(i,j) - f(i,j))/b ! Update solution u(i,j) = uold(i,j) - omega * resid ! Accumulate residual error error = error + resid*resid end do enddo error_local = error call MPI_ALLREDUCE ( error_local, error,1, & & MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr) include "mpif.h" common / mpicom/ me, np

  15. Jacobi Solver – MPI+OpenMP subroutine jacobi (n,m,mlo,mhi,dx,dy,alpha,omega,u,f,tol,maxit) use omp_lib include "mpif.h" common / mpicom/ me, np (MPI communications, etc) !$omp parallel !$omp do do j=mlo+1,mhi-1 do i=1,n uold(i,j) = u(i,j) enddo enddo !$omp end do !$omp end parallel call MPI_WAITALL ( reqcnt, reqary, reqstat, ierr) if ( ierr .ne. 0 ) then do i = 1, reqcnt print *, me, ': ', i, (reqstat(j,i),j=1,MPI_STATUS_SIZE) end do end if ! Compute stencil, residual, & update ….. In Main code: define the No. of threads call getenv ( 'OMP_NUM_THREADS', cnthreads) read (cnthreads,'(i2)') nthreads call MPI_BCAST(nthreads,1,MPI_INTEGER,0, MPI_COMM_WORLD,ierr) call omp_set_num_threads(nthreads)

  16. Jacobi Solver – MPI+OpenMP: continue !$omp parallel !$omp do private(resid) reduction(+:error) do j = mlo+1,mhi-1 do i = 2,n-1 ! Evaluate residual resid = (ax*(uold(i-1,j) + uold(i+1,j)) & & + ay*(uold(i,j-1) + uold(i,j+1)) & & + b * uold(i,j) - f(i,j))/b ! Update solution u(i,j) = uold(i,j) - omega * resid ! Accumulate residual error error = error + resid*resid end do enddo !$omp end do nowait !$omp end parallel error_local = error call MPI_ALLREDUCE ( error_local, error,1, & & MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr) ……

  17. Summary • Many contemporary parallel computers consists of a collection of multiprocessors • OpenMP enables us to take advantage of shared memory to reduce communication overhead • Pure OpenMP performs better than pure MPI within node is a necessity to have hybrid code better than pure MPI across node. • Whether the hybrid code performs better than MPI code depends on whether the communication advantage outcomes the thread overhead, etc. or not. • There are more positive experiences of developing hybrid MPI/OpenMP parallel paradigms now. It’s encouraging to adopt hybrid paradigm in your own application.

More Related