1 / 43

MPI An Introduction Kadin Tseng Scientific Computing and Visualization Group Boston University

MPI An Introduction Kadin Tseng Scientific Computing and Visualization Group Boston University. SCV Computing Facilities. SGI Origin2000 – 192 processors Lego (32), Slinky (32), Jacks(64), Playdoh(64) R10000, 195 MHz IBM SP - four 16-processor nodes Hal - login node Power3, 375 MHz

caspar
Download Presentation

MPI An Introduction Kadin Tseng Scientific Computing and Visualization Group Boston University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MPI An Introduction Kadin Tseng Scientific Computing and Visualization Group Boston University

  2. SCV Computing Facilities • SGI Origin2000 – 192 processors • Lego(32), Slinky (32), Jacks(64), Playdoh(64) • R10000, 195 MHz • IBM SP - four 16-processor nodes • Hal - login node • Power3, 375 MHz • IBM Regatta - three 32-processor nodes • Twister - login node (currently in friendly-user phase) • Power4, 1.3 GHz • Linux cluster - 38 2-processor nodes • Skate and Cootie - Two login nodes • 1.3 GHz Intel Pentium 3 (Myrinet Interconnect)

  3. Useful SCV Info • SCV home page http://scv.bu.edu/ • Batch queues http://scv.bu.edu/SCV/scf-techsumm.html • Resource Applications • MARINER http://acct.bu.edu/FORMS/Mariner-Pappl.html • Alliance http://www.ncsa.uiuc.edu/alliance/applying/ • Help • Online FAQs • Web-based tutorials (MPI, OpenMP, F90, MATLAB, Graphics tools) • HPC consultations by appointment • Doug Sondak (sondak@bu.edu) • Kadin Tseng (kadin@bu.edu) • help@tonka.bu.edu, bugs@tonka.bu.edu • Origin2000 Repository http://scv.bu.edu/SCV/Origin2000/ • IBM SP and Regatta Repository http://scv.bu.edu/SCV/IBMSP/ • Alliance web-based tutorials http://foxtrot.ncsa.uiuc.edu:8900/

  4. Multiprocessor Architectures Linux cluster • Shared-memory Between the 2 processors of each node. • Distributed-memory • There are 38 nodes; can use both processors in each node.

  5. Shared Memory Architecture P0 P1 P2 Pn • • • • Memory P0, P1, …, Pn are processors.

  6. Distributed Memory Architecture N0 N1 N2 • • • • Nn M0 M1 M2 Mn Interconnect M0, M1, … Mn are memories associated with nodes N0, N1, …, Nn. Interconnect is Myrinet (or Ethernet).

  7. Parallel Computing Paradigms • Parallel Computing Paradigms • Message Passing (MPI, PVM, ...) • Distributed and shared memory • Directives (OpenMP, ...) • Shared memory • Multi-Level Parallel programming (MPI + OpenMP) • Distributed/shared memory

  8. MPI Topics to Cover • Fundamentals • Basic MPI functions • Nonblocking send/receive • Collective Communications • Virtual Topologies • Virtual Topology Example • Solution of the Laplace Equation

  9. What is MPI ? • MPI stands for Message Passing Interface. • It is a library of subroutines/functions, not a computer language. • These subroutines/functions are callable from fortran or C programs. • Programmer writes fortran/C code, insert appropriate MPI subroutine/function calls, compile and finally link with MPI message passing library. • In general, MPI codes run on shared-memory multi-processors, distributed-memory multi-computers, cluster of workstations, or heterogeneous cluster of the above. • Current MPI is MPI-1. MPI-2 will be available in the near future.

  10. Why MPI ? • To enable more analyses in a prescribed amount of time. • To reduce time required for one analysis. • To increase fidelity of physical modeling. • To have access to more memory. • To provide efficient communication between nodes of networks of workstations, … • To enhance code portability; works for both shared- and distributed-memory. • For “embarrassingly parallel” problems, such as some Monte-Carlo applications, parallelizing with MPI can be trivial with near-linear (or even superlinear) scaling.

  11. MPI Preliminaries • MPI’s pre-defined constants, function prototypes, etc., are included in a header file. This file must be included in your code wherever MPI function calls appear (in “main” and in user subroutines/functions) : • #include “mpi.h” for C codes • #include “mpi++.h” for C++ codes • include “mpif.h” for fortran 77 and fortran 9x codes • MPI_Init must be the first MPI function called. • Terminates MPI by calling MPI_Finalize. • These two functions must only be called once in user code.

  12. MPI Preliminaries (continued) • C is case-sensitive language. MPI function names always begin with “MPI_”, followed by specific name with leading character capitalized, e.g., MPI_Comm_rank. MPI pre-defined constant variables are expressed in upper case characters, e.g.,MPI_COMM_WORLD. • Fortran is not case-sensitive. No specific case rules apply. • MPI fortran routines return error status as last argument of subroutine call, e.g., • call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr) • Error status is returned as int function value for C MPI functions, e.g., • int ierr = MPI_Comm_rank(MPI_COMM_WORLD, rank);

  13. What is A Message • Collection of data (array) of MPI data types • Basic data types such as int /integer, float/real • Derived data types • Message “envelope” – source, destination, tag, communicator

  14. Modes of Communication • Point-to-point communication • Blocking – returns from call when task complete • Several send modes; one receive mode • Nonblocking – returns from call without waiting for task to complete • Several send modes; one receive mode • Collective communication

  15. Essentials of Communication • Sender must specify valid destination. • Sender and receiver data type, tag, communicator must match. • Receiver can receive from non-specific (but valid) source. • Receiver returns extra (status) parameter to report info regarding message received. • Sender specifies size of sendbuf; receiver specifies upper bound of recvbuf.

  16. MPI Data Types vs C Data Types • MPI types -- C types • MPI_INT – signed int • MPI_UNSIGNED – unsigned int • MPI_FLOAT – float • MPI_DOUBLE – double • MPI_CHAR – char • . . .

  17. MPI vs Fortran Data Types • MPI_INTEGER – INTEGER • MPI_REAL – REAL • MPI_DOUBLE_PRECISION – DOUBLE PRECISION • MPI_CHARACTER – CHARACTER(1) • MPI_COMPLEX – COMPLEX • MPI_LOGICAL – LOGICAL • …

  18. MPI Data Types • MPI_PACKED • MPI_BYTE

  19. Some MPI Implementations There are a number of implementations : • MPICH (ANL) • Version 1.2.1 is latest. • There is a list of supported MPI-2 functionalities. • LAM (UND/OSC) • CHIMP (EPCC) • Vendor implementations (SGI, IBM, …) • Don’t worry! - as long as vendor supports the MPI Standard (IBM’s MPL does not). • Job execution procedures of implementations may differ.

  20. Example 1 (Integration) We will introduce some fundamental MPI function calls through the computation of a simple integral by the Mid-point rule. p is number of partitions and n is increments per partition

  21. Integrate cos(x) by Mid-point Rule

  22. Example 1 - Serial fortran code Program Example1 implicit none integer n, p, i, j real h, result, a, b, integral, pi pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration p = 4 ! number of partitions (processes) n = 500 ! # of increments in each part. h = (b-a)/p/n ! length of increment result = 0.0 ! Initialize solution to the integral do i=0,p-1 ! Integral sum over all partitions result = result + integral(a,i,h,n) enddo print *,'The result =',result stop end

  23. . . Serial fortran code (cont’d) example1.fcontinues . . . real function integral(a, i, h, n) implicit none integer n, i, j real h, h2, aij, a, fct, x fct(x) = cos(x) ! Integral kernel integral = 0.0 ! initialize integral h2 = h/2. do j=0,n-1 ! sum over all "j" integrals aij = a + (i*n + j)*h ! lower limit of integration integral = integral + fct(aij+h2)*h enddo return end

  24. Example 1 - Serial C code #include <math.h> #include <stdio.h> float integral(float a, int i, float h, int n); void main() { int n, p, i, j, ierr; float h, result, a, b, pi, my_result; pi = acos(-1.0); /* = 3.14159... * a = 0.; /* lower limit of integration */ b = pi/2.; /* upper limit of integration */ p = 4; /* # of partitions */ n = 500; /* increments in each process */ h = (b-a)/n/p; /* length of increment */ result = 0.0; for (i=0; i<p; i++) { /* integral sum over procs */ result += integral(a,i,h,n); } printf("The result =%f\n",result); }

  25. . . Serial C code (cont’d) example1.c continues . . . float integral(float a, int i, float h, int n) { int j; float h2, aij, integ; integ = 0.0; /* initialize integral */ h2 = h/2.; for (j=0; j<n; j++) { /* integral in each partition */ aij = a + (i*n + j)*h; /* lower limit of integration */ integ += cos(aij+h2)*h; } return integ; }

  26. Example 1 - Parallel fortran code • MPI functions used for this example: • MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize • MPI_Send, MPI_Recv PROGRAM Example1_1 implicit none integer n, p, i, j, ierr, master, Iam real h, result, a, b, integral, pi include "mpif.h“ ! pre-defined MPI constants, ... integer source, dest, tag, status(MPI_STATUS_SIZE) real my_result data master/0/ ! 0 is the master processor responsible ! for collecting integral sums …

  27. . . . Parallel fortran code (cont’d) ! Starts MPI processes ... call MPI_Init(ierr) ! Get current process id call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) ! Get number of processes from command line call MPI_Comm_size(MPI_COMM_WORLD, p, ierr) ! executable statements before MPI_Init is not ! advisable; side effect implementation-dependent pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration n = 500 ! increments in each process h = (b - a)/ p / n dest = master ! proc to compute final result tag = 123 ! set tag for job

  28. ... Parallel fortran code (cont’d) my_result = integral(a,Iam,h,n) ! compute local sum write(*,"('Process ',i2,' has the partial result of’, & f10.6)”)Iam,my_result if(Iam .eq. master) then result = my_result ! Init. final result do source=1,p-1 ! loop to collect local sum (not parallel !) call MPI_Recv(my_result, 1, MPI_REAL, source, tag, & MPI_COMM_WORLD, status, ierr) ! not safe result = result + my_result enddo print *,'The result =',result else call MPI_Send(my_result, 1, MPI_REAL, dest, tag, & MPI_COMM_WORLD, ierr) ! send my_result to dest. endif call MPI_Finalize(ierr) ! let MPI finish up ... end

  29. Example 1 - Parallel C code #include <mpi.h> #include <math.h> #include <stdio.h> float integral(float a, int i, float h, int n); /* prototype */ void main(int argc, char *argv[]) { int n, p, i; float h, result, a, b, pi, my_result; int myid, source, dest, tag; MPI_Status status; MPI_Init(&argc, &argv); /* start MPI processes */ MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* current proc. id */ MPI_Comm_size(MPI_COMM_WORLD, &p); /* # of processes */

  30. … Parallel C code (continued) pi = acos(-1.0); /* = 3.14159... */ a = 0.; /* lower limit of integration */ b = pi/2.; /* upper limit of integration */ n = 500; /* number of increment within each process */ dest = 0; /* define the process that computes the final result */ tag = 123; /* set the tag to identify this particular job */ h = (b-a)/n/p; /* length of increment */ i = myid; /* MPI process number range is [0,p-1] */ my_result = integral(a,i,h,n); printf("Process %d has the partial result of %f\n", myid,my_result);

  31. … Parallel C code (continued) if(myid == 0) { result = my_result; for (i=1;i<p;i++) { source = i; /* MPI process number range is [0,p-1] */ MPI_Recv(&my_result, 1, MPI_REAL, source, tag, MPI_COMM_WORLD, &status); result += my_result; } printf("The result =%f\n",result); } else { MPI_Send(&my_result, 1, MPI_REAL, dest, tag, MPI_COMM_WORLD);/* send my_result to “dest” } MPI_Finalize(); /* let MPI finish up ... */ }

  32. Example1_2 – Parallel Integration • MPI functions used for this example: • MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize • MPI_Recv, MPI_Isend, MPI_Wait PROGRAM Example1_2 implicit none integer n, p, i, j, k, ierr, master real h, a, b, integral, pi integer req(1) include "mpif.h" ! This brings in pre-defined MPI constants, ... integer Iam, source, dest, tag, status(MPI_STATUS_SIZE) real my_result, Total_result, result data master/0/

  33. Example1_2 (continued) c**Starts MPI processes ... call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) call MPI_Comm_size(MPI_COMM_WORLD, p, ierr) pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration n = 500 ! number of increment within each process dest = master ! define process that computes the final result tag = 123 ! set the tag to identify this particular job h = (b-a)/n/p ! length of increment my_result = integral(a,Iam,h,n) ! Integral of process Iam write(*,*)'Iam=',Iam,', my_result=',my_result

  34. Example1_2 (continued) if(Iam .eq. master) then ! the following is serial result = my_result do k=1,p-1 call MPI_Recv(my_result, 1, MPI_REAL, & MPI_ANY_SOURCE, tag, ! “wildcard” & MPI_COMM_WORLD, status, ierr) ! status - about source … result = result + my_result ! Total = sum of integrals enddo else call MPI_Isend(my_result, 1, MPI_REAL, dest, tag, & MPI_COMM_WORLD, req, ierr) ! send my_result to “dest” call MPI_Wait(req, status, ierr) ! wait for nonblock send ... endif c**results from all procs have been collected and summed ... if(Iam .eq. 0) write(*,*)'Final Result =',result call MPI_Finalize(ierr) ! let MPI finish up ... stop end

  35. Example1_3 Parallel Integration • MPI functions used for this example: • MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize • MPI_Bcast, MPI_Reduce, MPI_SUM PROGRAM Example1_3 implicit none integer n, p, i, j, ierr, master real h, result, a, b, integral, pi include "mpif.h" ! This brings in pre-defined MPI constants, ... integer Iam, source, dest, tag, status(MPI_STATUS_SIZE) real my_result data master/0/

  36. Example1_3 (continued) c**Starts MPI processes ... call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) call MPI_Comm_size(MPI_COMM_WORLD, p, ierr) pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration dest = 0 ! define the process that computes the final result tag = 123 ! set the tag to identify this particular job if(Iam .eq. master) then print *,'The requested number of processors =',p print *,'enter number of increments within each process' read(*,*)n endif c**Broadcast "n" to all processes call MPI_Bcast(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)

  37. Example1_3 (continued) c**Starts MPI processes ... h = (b-a)/n/p ! length of increment my_result = integral(a,Iam,h,n) write(*,"('Process ',i2,' has the partial result of',f10.6)") & Iam,my_result call MPI_Reduce(my_result, result, 1, MPI_REAL, MPI_SUM, dest, & MPI_COMM_WORLD, ierr) ! Compute integral sum if(Iam .eq. master) then print *,'The result =',result endif call MPI_Finalize(ierr) ! let MPI finish up ... stop end

  38. Compilations With MPICH • Create Makefile is practical way to compile program CC = mpicc CCC = mpiCC F77 = mpif77 F90 = mpif90 OPTFLAGS = -O3 f-example: f-example.o $(F77) $(OPTFLAGS) -o f-example f-example.o c-example: c-example.o $(CC) $(OPTFLAGS) -o c-example c-example.o

  39. Compilations With MPICH (cont’d) Makefile continues . . . .c.o: $(CC) $(OPTFLAGS) -c $*.c .f.o: $(F77) $(OPTFLAGS) -c $*.f .f90.o: $(F90) $(OPTFLAGS) -c $*.f90 .SUFFIXES: .f90

  40. Run Jobs Interactively On SCV’s Linux cluster, use mpirun.ch_gm to run MPI jobs. skate% mpirun.ch_gm –np 4 example (alias mpirun ‘/usr/local/mpich/1.2.1..7b/gm-1.5.1_Linux-2.4.7-10enterprise/smp/pgi/ssh/bin/mpirun.ch_gm)

  41. Run Jobs in Batch We use PBS, or Portable Batch System, to manage batch jobs. % qsub my_pbs_script Where qsub is a batch submission script while my_pbs_script is a user-provided script containing user-specific pbs batch job info.

  42. My_pbs_script #!/bin/bash # Set the default queue #PBS -q dque # Request 4 nodes with 1 processor per node #PBS -l nodes=4:ppn=1,walltime=00:10:00 MYPROG="psor" GMCONF=~/.gmpi/conf.$PBS_JOBID /usr/local/xcat/bin/pbsnodefile2gmconf $PBS_NODEFILE >$GMCONF cd $PBS_O_WORKDIR NP=$(head -1 $GMCONF) mpirun.ch_gm --gm-f $GMCONF --gm-recv polling --gm-use-shmem --gm-kill 5 –-gm-np $NP PBS_JOBID=$PBS_JOBID $MYPROG

  43. Output of Example1_1 skate% make –f make.linux example1_1 skate% mpirun.ch_gm –np 4 example1_1 Process 1 has the partial result of 0.324423 Process 2 has the partial result of 0.216773 Process 0 has the partial result of 0.382683 Process 3 has the partial result of 0.076120 The result = 1.000000 Not in order !

More Related