MPI and OpenMP

MPI and OpenMP By: Jesus Caban and Matt McKnight

What is MPI? • MPI: Message Passing Interface • Is not a new programming language, is a library with functions that can be called from C/Fortran/Python • Successor to PVM (Parallel Virtual Machine ) • Developed by an open, international forum with representation from industry, academia and government laboratories.

What it’s for? • Allows data to be passed between processes in a distributed memory environment • Provides source-code portability • Allows efficient implementation • A great deal of functionality • Support for heterogeneous parallel architectures

MPI Communicator • Idea: • Group of processors that are allowed to communicate to each other • Most often use communicators • MPI_COMM_WORLD • Note MPI Format: MPI_XXX var = MPI_Xxx(parameters); MPI_Xxx(parameters);

Getting Started Include MPI header file Initialize MPI environment Work: Make message passing calls Send Receive Terminate MPI environment

Include File Include Include MPI header file #include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(int argc, char** argv){ … } Initialize Work Terminate

Initialize MPI Include Initialize MPI environment int main(int argc, char** argv){ int numtasks, rank; MPI_Init (*argc,*argv) ; MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); ... } Initialize Work Terminate

Initialize MPI (cont.) MPI_Init (&argc,&argv) Not MPI functions called before this call. MPI_Comm_size(MPI_COMM_WORLD, &nump) A communicator is a collection of processes that can send messages to each other. MPI_COMM_WORLD is a predefined communicator that consists of all the processes running when the program execution begins. MPI_Comm_rank(MPI_COMM_WORLD, &myrank) In order for a process to find out its rank. Include Initialize Work Terminate

Terminate MPI environment Terminate MPI environment Include #include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(int argc, char** argv){ … MPI_Finalize(); } Initialize Work No MPI functions called after this call. Terminate

Let’s work with MPI Work: Make message passing calls (Send, Receive) Include if(my_rank != 0){ MPI_Send(data, strlen(data)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else{ MPI_Recv(data, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); } Initialize Work Terminate

Work (cont.) int MPI_Send ( void* message, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Include Initialize Work int MPI_Recv ( void* message, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm MPI_Status *status) Terminate

Hello World!! #include "mpi.h" int main(int argc, char* argv[]) { int my_rank, p, source, dest, tag = 0; char message[100]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); if (my_rank != 0) { /* Create message */ sprintf(message, “Hello from process %d!", my_rank); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); }else { for(source = 1; source < p; source++) { MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s", message); }} MPI_Finalize(); }

Compile and Run MPI • Compile • gcc –c hello.exe mpi_hello.c –lmpi • mpicc mpi_hello.c • Run • mpirun –np 5 hello.exe • Output $mpirun –np 5 hello.exe Hello from process 1! Hello from process 2! Hello from process 3! Hello from process 4!

More MPI Functions • MPI_Bcast( void *m, int s, MPI_Datatype dt, int root, MPI_Comm) • Sends a copy of the data inm on the process with rank root to each process in the communicator. • MPI_Reduce( void *operand, void* result, int count, MPI_Datatype datatye, MPI_Op operator, int root, MPI_Comm comm) • Combines the operands stored in the memory referenced by operand using operation operator and stores the result in res on process root. • double MPI_Wtime( void) • Returns a double precision value that represents the number of seconds that have elapsed since some point in the past. • MPI_Barrier ( MPI_Comm comm) • Each process in comm block until every process in comm has called it.

More Examples • Trapezoidal Rule: • Integral from a to b of a nonnegative function f(x) • Approach: Estimating the area by partitioning the region into regular geometric shapes and then add the areas of the shapes • Compute Pi

Compute PI #include <stdio.h> #include "mpi.h" #define PI 3.141592653589793238462643 #define PI_STR "3.141592653589793238462643" #define MAXLEN 40 #define f(x) (4./(1.+ (x)*(x))) void main(int argc, char *argv[]){ int N=0,rank,nprocrs,i,answer=1; double mypi,pi,h,sum, x, starttime,endtime,runtime,runtime_max; char buff[MAXLEN]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf(“CPU %d saying hello",rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocrs); if(rank==0) printf("Using a total of %d CPUs",nprocrs);

Compute PI while(answer){ if(rank==0){ printf("This program computes pi as “ "4.*Integral{0->1}[1/(1+x^2)]"); printf("(Using PI = %s)",PI_STR); printf("Input the Number of intervals: N ="); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&N); printf("pi will be computed with %d intervals on %d processors.", N ,nprocrs); } /*Procr 0 = P(0) gives N to all other processors*/ MPI_Bcast(&N,1,MPI_INT,0,MPI_COMM_WORLD); if(N<=0) goto end_program;

Compute PI starttime=MPI_Wtime(); sum=0.0; h=1./N; for(i=1+rank;i<=N;i+=nprocrs){ x=h*(i-0.5); sum+=f(x); } mypi=sum*h; endtime=MPI_Wtime(); runtime=endtime-starttime; MPI_Reduce(&mypi,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); MPI_Reduce(&runtime,&runtime_max,1,MPI_DOUBLE,MPI_MAX,0, MPI_COMM_WORLD); printf("Procr %d: runtime = %f",rank,runtime); fflush(stdout); if(rank==0){ printf("For %d intervals, pi = %.14lf,error=%g",N,pi,fabs(pi-PI));

Compute PI printf("computed in = %f secs",runtime_max); fflush(stdout); printf("Do you wish to try another run? (y=1;n=0)"); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&answer); } /*processors wait while P(0) gets new input from user*/ MPI_Barrier(MPI_COMM_WORLD); MPI_Bcast(&answer,1,MPI_INT,0,MPI_COMM_WORLD); if(!answer) break; } end_program: printf("\nProcr %d: Saying good-bye!\n",rank); if(rank==0) printf("\nEND PROGRAM\n"); MPI_Finalize(); }

Compile and Run Example 2 • Compile • gcc –c pi.exe pi.c –lmpi $mpirun –np 2 pi.exe Procr 1 saying hello. Procr 0 saying hello Using a total of 2 CPUs This program computes pi as 4.*Integral{0->1}[1/(1+x^2)] (Using PI = 3.141592653589793238462643) Input the Number of intervals: N = 10 pi will be computed with 10 intervals on 2 processors Procr 0: runtime = 0.000003 Procr 1: runtime = 0.000003 For 10 intervals, pi = 3.14242598500110, error = 0.000833331 computed in = 0.000003 secs

What is ? • Similar to MPI, but used for shared memory parallelism • Simple set of directives • Incremental parallelism • Unfortunately only works with proprietary compilers…

Compilers and Platforms • Compilers and Platforms • Fujitsu/Lahey Fortran, C and C++ • Intel Linux Systems • Sun Solaris Systems • HP HP-UX PA-RISC/Itanium • Fortran • C • aC++ • HP Tru64 Unix • Fortran • C • C++ • IBM XL Fortran and C from IBM • IBM AIX Systems • Intel C++ and Fortran Compilers from Intel • Intel IA32 Linux Systems • Intel IA32 Windows Systems • Intel Itanium-based Linux Systems • Intel Itanium-based Windows Systems • Guide Fortran and C/C++ from Intel's KAI Softare Lab • Intel Linux Systems • Intel Windows Systems • PGF77 and PGF90 Compilers from The Portland Group, Inc. (PGI) • Intel Linux Systems • Intel Solaris Systems • Intel Windows/NT Systems • SGI MIPSpro 7.4 Compilers • SGI IRIX Systems • Sun Microsystems Sun ONE Studio 8, Compiler Collection, Fortran 95, C, and C++ • Sun Solaris Platforms • Compiler Collection Portal • VAST from Veridian Pacific-Sierra Research • IBM AIX Systems • Intel IA32 Linux Systems • Intel Windows/NT Systems • SGI IRIX Systems • Sun Solaris Systems taken from www.openmp.org

How do you use OpenMP? • C/C++ API • Parallel Construct – when a ‘region’ of the program can be executed in multiple parallel threads, this fundamental construct starts the execution. #pragma omp parallel[clause[ [, ]clase] …] new-line structured-block The clause is one of the following: if (scalar–expression) private (variable-list) firstprivate (variable-list) default (shared | none) shared (variable-list) copyin (variable-list) reduction (operator:variable-list) num_threads (integer-expression)

FundamentalConstructs • forConstruct • Defines an iterative work-sharing construct in which the iterations of the associated loop will execute in parallel. • Sections Construct • Identifies a noniterative work-sharing construct that specifies a set of constructs that are to be divided among threads, each section being executed only once by each thread

single Construct • associates a structured block’s execution with only one thread • parallel for Construct • Shortcut for a parallel region containing only one for directive • parallel sections Construct • Shortcut for a parallel region containing only a single sections directive

Master and Synchronization Directives • master Construct • Specifies a structured block that is executed by the master thread of the team • critical Construct • Restricts execution of the associated structured block to a single thread at a time • barrier Directive • Synchronize all threads in a team. When this construct is encountered, all threads wait until the others have reached this point.

atomic Construct • Ensures that a specific memory location is updated ‘atomically’ (meaning only one thread is allowed write-access at a time) • flush Directive • Specifies a “cross-thread” sequence point at which all threads in a team are ensured a “clean” view of certain objects in memory • ordered Construct • A structured block following this directive will iterate in the same order as if executed in a sequential loop.

Data • How do we control the data in this SMP environment? • threadprivate Directive • makes files-scope and namespace-scope private to a thread • Data-Sharing Attributes • private - private to each thread • firstprivate • lastprivate • shared – shared among all threads • default – User affects attributes • reduction – perform reduction on scalars • copyin – assign the same value to threadprivate variables • copyprivate – broadcast the value of a private variable from one member of a team to the others

Scalability test on SGI Origin 2000 Timing results of the dot product test in milliseconds for n = 16 * 1024. www.public.iastate.edu/~grl/HFP1/hpf.openmp.mpi.June6.2002.html

Timing results of matrix times matrix test in milliseconds for n = 128 www.public.iastate.edu/~grl/HFP1/hpf.openmp.mpi.June6.2002.html

Architecture comparison From http://www.csm.ornl.gov/~dunigan/sgi/

References • Book: Parallel Programming with MPI, Peter Pacheco • www-unix.mcs.anl.gov/mpi • http://alliance.osc.edu/impi/ • http://rocs.acomp.usf.edu/tut/mpi.php • http://www.lam-mpi.org/tutorials/nd/ • www.openmp.org • www.public.iastate.edu/~grl/HFP1/hpf.openmp.mpi.June6.2002.html

MPI and OpenMP

MPI and OpenMP

Presentation Transcript

Parallel Programming in C with MPI and OpenMP

Hybrid Programming with OpenMP and MPI

Hybrid openmp / mpi

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Hybrid OpenMP and MPI Programming

Introduction to Mixed MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Hybrid OpenMP and MPI Programming and Tuning

Parallel Programming with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Introduction to MPI, OpenMP, Threads

Integrated MPI/OpenMP Performance Analysis

Parallel Programming in C with MPI and OpenMP

Parallel Programming with MPI and OpenMP