420 likes | 532 Views
Hybrid OpenMP and MPI Programming. MPI vs. OpenMP. Pure MPI Pro: Portable to distributed and shared memory machines. Scales beyond one node No data placement problem Pure MPI Con: Difficult to develop and debug High latency, low bandwidth Explicit communication Large granularity
E N D
MPI vs. OpenMP • Pure MPI Pro: • Portable to distributed and shared memory machines. • Scales beyond one node • No data placement problem • Pure MPI Con: • Difficult to develop and debug • High latency, low bandwidth • Explicit communication • Large granularity • Difficult load balancing • Pure OpenMP Pro: • Easy to implement parallelism • Low latency, high bandwidth • Implicit Communication • Coarse and fine granularity • Dynamic load balancing • Pure OpenMP Con: • Only on shared memory machines • Scale within one node • Possible data placement problem • No specific thread order
MPI vs. MPI+OpenMP Node MPI MPI+OpenMP https://www.sharcnet.ca/~jemmyhu/CES706/mpi+smp.swf
Why Hybrid • Elegant in concept and architecture: using MPI across nodes and OpenMP within nodes. Good usage of shared memory system resource (memory, latency, and bandwidth). • Avoids the extra communication overhead with MPI within node. • OpenMP adds fine granularity (larger message sizes) and allows increased and/or dynamic load balancing. • Some problems have two-level parallelism naturally. • Some problems could only use restricted number of MPI tasks.
Why Mixed OpenMP/MPI Code is Sometimes Slower? • OpenMP has less scalability due to implicit parallelism while MPI allows multi-dimensional blocking. • All threads are idleexcept one while MPI communication. • Need overlap comp and comm for better performance. • Critical Section for shared variables. • Thread creation overhead • Cache coherence, data placement. • Natural one level parallelism problems. • Pure OpenMP code performs worse than pure MPI within node. • Lack of optimized OpenMP compilers/libraries. • Positive and Negative experiences: • Positive: CAM, MM5, … • Negative: NAS, CG, PS, …
A Pseudo Hybrid Code Program hybrid call MPI_INIT (ierr) call MPI_COMM_RANK (…) call MPI_COMM_SIZE (…) … some computation and MPI communication call OMP_SET_NUM_THREADS(4) !$OMP PARALLEL DO PRIVATE(i) !$OMP& SHARED(n) do i=1,n … computation enddo !$OMP END PARALLEL DO … some computation and MPI communication call MPI_FINALIZE (ierr) end
Loop-based vs. SPMD SPMD: !$OMP PARALLEL PRIVATE(start, end, i) !$OMP& SHARED(a,b) num_thrds = omp_get_num_threads() thrd_id = omp_get_thread_num() start = n* thrd_id/num_thrds + 1 end = n*(thrd_id+1)/num_thrds do i = start, end a(i)=a(i)+b(i) enddo !$OMP END PARALLEL Loop-based: !$OMP PARALLEL DO PRIVATE(i) !$OMP& SHARED(a,b,n) do i=1,n a(i)=a(i)+b(i) enddo !$OMP END PARALLEL DO • SPMD code normally gives better performance than loop-based code, but more difficult to implement: • Less thread synchronization. • Less cache misses. • More compiler optimizations.
Hybrid Parallelization Strategies • From sequential code, decompose with MPI first, then add OpenMP. • From OpenMP code, treat as serial code. • From MPI code, add OpenMP. • Simplest and least error-prone way is to use MPI outside parallel region, and allow only master thread to communicate between MPI tasks. • Could use MPI inside parallel region with thread-safe MPI.
Examples • pi Calculation • Matrix-Vector Multiplication • 2D Laplace Equation • 2D Helmholtz Equation
pi – MPI version #include <stdio.h> #include <stdlib.h> #include <mpi.h> /* MPI header file */ #define NUM_STEPS 100000000 int main(int argc, char *argv[]) { int nprocs; int myid; double start_time, end_time; int i; double x, pi; double sum = 0.0; double step = 1.0/(double) NUM_STEPS; /* initialize for MPI */ MPI_Init(&argc, &argv); /* starts MPI */ /* get number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &nprocs); /* get this process's number (ranges from 0 to nprocs - 1) */ MPI_Comm_rank(MPI_COMM_WORLD, &myid);
/* record start time * start_time = MPI_Wtime(); /* do computation */ for (i=myid; i < NUM_STEPS; i += nprocs) { /* changed */ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } sum = step * sum; /* changed */ MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);/* added */ /* record end time */ end_time = MPI_Wtime(); /* print results */ if (myid == 0) { printf("parallel program results with %d processes:\n", nprocs); printf("pi = %g (%17.15f)\n",pi, pi); printf("time to compute = %g seconds\n", end_time - start_time); } /* clean up for MPI */ MPI_Finalize(); return 0; }
OMP does not support vector reduction thread = process c = 0.0 do j = 1, n_loc !$OMP DO PARALLEL !$OMP SHARED(a,b), PRIVATE(i) !$OMP REDUCTION(+:c) do i = 1, nrows c(i) = c(i) + a(i,j)*b(i) enddo enddo call MPI_REDUCE_SCATTER(c) • OMP does not support vector reduction • Wrong answer since c is shared!
Jacobi Solver – OpenMP subroutine jacobi (n,m,dx,dy,alpha,omega,u,f,tol,maxit) ... (while loop) #pragma omp parallel private(resid, i){ #pragma omp for for (j=0; j<m; j++) for (i=0; i<n; i++) uold[i + m*j] = u[i + m*j]; /* compute stencil, residual and update */ #pragma omp for reduction(+:error) for (j=1; j<m-1; j++) for (i=1; i<n-1; i++){ resid = ( ax * (uold[i-1 + m*j] + uold[i+1 + m*j]) + ay * (uold[i + m*(j-1)] + uold[i + m*(j+1)]) + b * uold[i + m*j] - f[i + m*j]) / b; /* update solution */ u[i + m*j] = uold[i + m*j] - omega * resid; /* accumulate residual error */ error =error + resid*resid; } } /* end parallel */ use omp_lib
Jacobi Solver – MPI subroutine jacobi (n,m,mlo,mhi,dx,dy,alpha,omega,u,f,tol,maxit) …(communication, MPI_IRECV, MPI_ISEND, etc) do j=mlo+1,mhi-1 do i=1,n uold(i,j) = u(i,j) enddo enddo call MPI_WAITALL ( reqcnt, reqary, reqstat, ierr) if ( ierr .ne. 0 ) then do i = 1, reqcnt print *, me, ': ', i, (reqstat(j,i),j=1,MPI_STATUS_SIZE) end do end if ! Compute stencil, residual, & update do j = mlo+1,mhi-1 do i = 2,n-1 ! Evaluate residual resid = (ax*(uold(i-1,j) + uold(i+1,j)) & & + ay*(uold(i,j-1) + uold(i,j+1)) & & + b * uold(i,j) - f(i,j))/b ! Update solution u(i,j) = uold(i,j) - omega * resid ! Accumulate residual error error = error + resid*resid end do enddo error_local = error call MPI_ALLREDUCE ( error_local, error,1, & & MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr) include "mpif.h" common / mpicom/ me, np
Jacobi Solver – MPI+OpenMP subroutine jacobi (n,m,mlo,mhi,dx,dy,alpha,omega,u,f,tol,maxit) use omp_lib include "mpif.h" common / mpicom/ me, np (MPI communications, etc) !$omp parallel !$omp do do j=mlo+1,mhi-1 do i=1,n uold(i,j) = u(i,j) enddo enddo !$omp end do !$omp end parallel call MPI_WAITALL ( reqcnt, reqary, reqstat, ierr) if ( ierr .ne. 0 ) then do i = 1, reqcnt print *, me, ': ', i, (reqstat(j,i),j=1,MPI_STATUS_SIZE) end do end if ! Compute stencil, residual, & update ….. In Main code: define the No. of threads call getenv ( 'OMP_NUM_THREADS', cnthreads) read (cnthreads,'(i2)') nthreads call MPI_BCAST(nthreads,1,MPI_INTEGER,0, MPI_COMM_WORLD,ierr) call omp_set_num_threads(nthreads)
Jacobi Solver – MPI+OpenMP: continue !$omp parallel !$omp do private(resid) reduction(+:error) do j = mlo+1,mhi-1 do i = 2,n-1 ! Evaluate residual resid = (ax*(uold(i-1,j) + uold(i+1,j)) & & + ay*(uold(i,j-1) + uold(i,j+1)) & & + b * uold(i,j) - f(i,j))/b ! Update solution u(i,j) = uold(i,j) - omega * resid ! Accumulate residual error error = error + resid*resid end do enddo !$omp end do nowait !$omp end parallel error_local = error call MPI_ALLREDUCE ( error_local, error,1, & & MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr) ……
Summary • Many contemporary parallel computers consists of a collection of multiprocessors • OpenMP enables us to take advantage of shared memory to reduce communication overhead • Pure OpenMP performs better than pure MPI within node is a necessity to have hybrid code better than pure MPI across node. • Whether the hybrid code performs better than MPI code depends on whether the communication advantage outcomes the thread overhead, etc. or not. • There are more positive experiences of developing hybrid MPI/OpenMP parallel paradigms now. It’s encouraging to adopt hybrid paradigm in your own application.