380 likes | 515 Views
An Introduction to Parallel Computing with the Message Passing Interface. Justin T. Newcomer Math 627 – Introduction to Parallel Computing University of Maryland, Baltimore County (UMBC) December 19, 2006. Acknowledgments: Dr. Matthias K. Gobbert, UMBC. The Need for Parallel Computing.
E N D
An Introduction to Parallel Computing with the Message Passing Interface Justin T. Newcomer Math 627 – Introduction to Parallel Computing University of Maryland, Baltimore County (UMBC) December 19, 2006 Acknowledgments: Dr. Matthias K. Gobbert, UMBC
The Need for Parallel Computing • To increase computational speed • Programs need to run in a “reasonable” amount of time • Predicting tomorrows weather can’t take 2 days • To increase available memory • Problem sizes continue to increase • To improve the accuracy of results • If we can solve larger problems faster than we can improve the accuracy of our results • Bottom line: We want to solve larger problems faster and more accurately
Classification of Parallel Systems • In 1966 Michael Flynn classified systems according to the number of instruction streams and data streams • This is known as Flynn’s Taxonomy • The two extremes are SISD and MIMD systems • Single-Instruction Single-Data (SISD) • The classical von Neumann Machine – CPU and main memory • Multiple-Instruction Multiple-Data • A collection of processors operate on their own data streams • Intermediate systems are SIMD and MISD systems • Parallel systems can be shared memory or distributed memory
Single-Program Multiple-Data (SPMD) • The most general form of a MIMD system is where each process runs a completely different program • In practice this is usually not needed • The “appearance” of each process running a different program is accomplished through branching statements on the process id’s • This form of MIMD programming is known as Single-Program Multiple-Data (SPMD) • NOT the same as SIMD (Single-Instruction Multiple-Data) • Message passing is the most common method of programming MIMD systems • This talk will focus on SPMD programs
Parallel Resources at UMBC • math-cluster4 – 8 processor Beowulf cluster • Four dual-processor Linux PCs • 1000 MHz Intel Pentium III processors and 1 GB of memory • Nodes are connected by 100 Mbps ethernet cables and a dedicated switch • The only machine with a connection to the outside network is pc51 • KALI – 64 processor Beowulf Cluster • Purchased using funds from a SCREMS grant from the National Science Foundation with equal cost-sharing from UMBC • Used for research projects including the areas of microelectronics manufacturing, quantum chemistry, computational neurobiology, and constrained mechanical systems • The machine is managed jointly by system administrators from the department and UMBC's Office of Information Technology • Hercules - 8 processor P4 IBM X440 System (MPI Not Available)
Hardware Specifications of the Cluster KALI • Executive Summary: • 64-processor Beowulf cluster • with a high-performance Myrinet interconnect • Summary: • Each node: 2 Intel Xenon 2.0 GHz (512 kB L2 cache) with at least 1 GB of memory • 32 computational nodes (31 compute and 1 storage) • High-performance Myrinet network for computations • Ethernet for file serving from a 0.5 TB RAID (= redundant array of independent disks) array • 1 management and user node
The Network Schematic • Management network not shown
Back of the Computational Nodes (Rack H5) Bottom Half Top Half
How to Program a Beowulf Cluster • Memory is distributed across nodes and only accessible by local CPU’s • Total memory: 321 GB • But: 2 CPU’s share the memory of a single node • Should one use both CPU’s per node or only one? • Algorithm design: Divide problem into pieces with as little dependence on each other as possible, then program communications explicitly using MPI (Message-Passing Interface) • Fully portable code • Typical problems • Domain split communication of solution values on interfaces (lower-dimensional region) • Communication at every time-step / in every iteration
What is MPI? • The Message Passing Interface (MPI) forum has developed a standard for programming parallel systems • Rather than specifying a new language (and a new compiler), MPI has taken the form of a library of functions that can be called from a C, C++, or Fortran program • The foundation of this library are the various functions that can be used to achieve parallelism by message passing • A message passing function is simply a function that explicitly transmits data from one process to another
The Message Passing Philosophy • Parallel programs consist of several processes, each with its own memory, working together to solve a single function • Processes communicate with each other by passing messages (data) back and forth • Message passing is a powerful and very general method of expressing parallelism • MPI provides a way to create efficient, portable, and scalable parallel code
Message Passing Example – “Hello World” #include <stdio.h> #include <string.h> #include "mpi.h" int main(int argc, char* argv[]) { int my_rank; /* Rank of process */ int np; /* Number of processes */ int source; /* Rank of sender */ int dest; /* Rank of receiver */ int tag = 0; /* Tag for messages */ char message[100]; /* Storage for message */ MPI_Status status; /* Return status for receive */ /* Initialize MPI */ MPI_Init(&argc, &argv); /* Acquire current process rank */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* Acquire number of processes being used */ MPI_Comm_size(MPI_COMM_WORLD, &np);
Message Passing Example if(my_rank != 0) { sprintf(message, "Greetings from process %d! I am one of %d processes.", my_rank, np); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,MPI_COMM_WORLD); } else { for(source = 1; source < p; source++) { MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n", message); } } MPI_Finalize(); return 0; }
Example Output Greetings from process 0! I am one of 16 processes Greetings from process 1! I am one of 16 processes Greetings from process 2! I am one of 16 processes Greetings from process 3! I am one of 16 processes Greetings from process 4! I am one of 16 processes Greetings from process 5! I am one of 16 processes Greetings from process 6! I am one of 16 processes Greetings from process 7! I am one of 16 processes Greetings from process 8! I am one of 16 processes Greetings from process 9! I am one of 16 processes Greetings from process 10! I am one of 16 processes Greetings from process 11! I am one of 16 processes Greetings from process 12! I am one of 16 processes Greetings from process 13! I am one of 16 processes Greetings from process 14! I am one of 16 processes Greetings from process 15! I am one of 16 processes
Available Compilers on KALI • Two suites of compilers are available on kali, one from Intel and one from the GNU • The Intel compilers are icc for C/C++ and ifort for Fortran 90/95 • The GNU compilers are gcc for C/C++ and g77 for Fortran 77 • You can list all available MPI implementations by > switcher mpi --list This should list lam-7.0.6, mpich-ch_gm-icc-1.2.5.9, and mpich-ch_gm-gcc-1.2.5.9 • You can show the current MPI implementation by > switcher mpi --show • If you want to switch to another MPI implementation, for instance, to use the Intel compiler suite, say > switcher mpi = mpich-ch_gm-icc-1.2.5.9
Compiling and Linking MPI Code on KALI • Let's assume that you have a C code sample.c that contains some MPI commands. The compilation and linking of your code should work just like on other Linux clusters using the mpich implementation of MPI. Hence, compile and link the code both in one step by > mpicc -o sample sample.c • If your code includes mathematical functions (like exp, cos, etc.), you need to link to the mathematics library libm.so. This is done, just like for serial compiling, by adding -lm to the end of your combined compile and link command, that is, > mpicc -o sample sample.c -lm In a similar fashion, other libraries can be linked • See the man page of mpicc for more information by saying > man mpicc • Finally, to be doubly sure which compiler is accessed by your MPI compile script, you can use the -show option as in > mpicc -o sample sample.c -show
Submitting a Job on KALI • KALI uses the TORQUE resource manager and the Maui scheduler which are both open source programs • A job, an executable with its command-line arguments, is submitted to the scheduler with the qsub command • In the directory, in which you want to run your code, you need to create a script file that tells the scheduler more details about how to start the code, what resources you need, where to send output, and some other items • This script is used as command-line argument to the qsub command by saying > qsub qsub-script
Example qsub Script • Let's call this file qsub-script in this example. It should look like this: #!/bin/bash : : The following is a template for job submission for the : Scheduler on kali.math.umbc.edu : : This defines the name of your job #PBS -N MPI_Sample : This is the path #PBS -o . #PBS -e . #PBS -q workq #PBS -l nodes=8:myrinet:ppn=2 cd $PBS_O_WORKDIR mpiexec -nostdout <-pernode> Sample
The qstat command • Once you have submitted your job to the scheduler, you will want to confirm that it has been entered into the queue. Use qstat at the command-line to get output similar to this: • The most interesting column is the one titled S for “status” • The letter Q indicates that your job has been queued, that is, it is waiting for resources to become available and will then be executed • The letter R indicates that your job is currently running • The letter E says that your job is exiting; this will appear during the shut-down phase, after the job has actually finished execution Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ------------ -------- ----- ---------- ------ --- --- ------ ----- - ----- 635.mgtnode gobbert workq MPI_DG 2320 8 1 -- 10000 R 716:0 636.mgtnode gobbert workq MPI_DG 2219 8 1 -- 10000 R 716:1 665.mgtnode gobbert workq MPI_Nodesu -- 16 1 -- 10000 Q -- 704.mgtnode gobbert workq MPI_Nodesu 12090 15 1 -- 10000 E 00:00 705.mgtnode kallen1 workq MPI_Aout -- 1 1 -- 10000 Q -- 706.mgtnode gobbert workq MPI_Nodesu -- 15 1 -- 10000 Q -- 707.mgtnode gobbert workq MPI_Nodesu -- 15 1 -- 10000 Q --
An Application: The Poisson Problem • The Poisson problem is a partial differential equation that is discretized by the finite difference method using a five-point stencil. • The Poisson problem can be expressed by the equations • We can approximate the Poisson problem using the finite element method and use the iterative Jacobi method to obtain a numerical solution given by • Here we have partitioned the domain into a mesh grid of dimension NN • This produces a sparse matrix of dimension N2N2 • This does not present a problem however because applying the Jacobi method to this problem gives us a matrix free solution
Parallel Implementation • The mesh grid is distributed among processes by blocks of rows of the mesh • The boundary points then must be communicated to neighboring processes at each iteration to obtain the updates
The Parallel Algorithm • The generic algorithm involves four steps per iteration: • The exchange of boundary points is accomplished as follows: 1) Post communication requests to exchange neighboring points 2) Calculate Jacobi iteration on all local interior points 3) Wait for the exchange of neighboring points to complete 4) Calculate Jacobi iteration on all local boundary points process = 0 send top row of local mesh to process 1 receive above points from process 1 process = i : 0 < i < np-1 send top row of local mesh to process i+1 send bottom row of local mesh to process i-1 receive above points from process i+1 receive below points from process i-1 process = np-1 send bottom row of local mesh to process np-2 receive below points from process np-2
Sample Code MPI_Barrier(MPI_COMM_WORLD); start = MPI_Wtime(); err = ((double) 1) + tol; it = 0; while( (err > tol) && (it < itmax) ) { it = it + 1; if(it > 1){update(l_u, l_new_u, l_N, N);} local_jacobi_sweep(l_u, l_new_u, N, l_N, id); exchange1(l_u, below_points, above_points, N, l_N, np, id); bound_jacobi_sweep(l_u, below_points, above_points, l_new_u, N, l_N, id); l_err = vector_norm(l_u, l_new_u, N, l_N); MPI_Allreduce(&l_err, &err, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); err = sqrt(err); } MPI_Barrier(MPI_COMM_WORLD); finish = MPI_Wtime();
The Exchange Function void exchange1(double *l_u, double *below_points, double *above_points, int n, int l_n, int np, int id) { int idbelow, idabove; MPI_Status status; get_neighbors(&idbelow, &idabove, id, np); if(id%2 == 0) { MPI_Send(&(l_u[(l_n-1)*n]), n, MPI_DOUBLE, idabove, 0, MPI_COMM_WORLD); MPI_Recv(below_points, n, MPI_DOUBLE, idbelow, 0, MPI_COMM_WORLD, &status); MPI_Send(l_u, n, MPI_DOUBLE, idbelow, 1, MPI_COMM_WORLD); MPI_Recv(above_points, n, MPI_DOUBLE, idabove, 1, MPI_COMM_WORLD, &status); } else { MPI_Recv(below_points, n, MPI_DOUBLE, idbelow, 0, MPI_COMM_WORLD, &status); MPI_Send(&(l_u[(l_n-1)*n]), n, MPI_DOUBLE, idabove, 0, MPI_COMM_WORLD); MPI_Recv(above_points, n, MPI_DOUBLE, idabove, 1, MPI_COMM_WORLD, &status); MPI_Send(l_u, n, MPI_DOUBLE, idbelow, 1, MPI_COMM_WORLD); } }
Collective Communications • MPI also includes specialized functions that allow collective communication. Collective communication is communication that involves all processes • In the Poisson problem example we need to calculate the Euclidean vector norm of the difference between two iterations at each iteration. The norm is given by: • Once computed, the norm must be available on each process Computed locally on each process
MPI_Allreduce • The MPI_Allreduce function is a collective communication function provided can accomplish the task we need to accomplish • It reduces the local pieces of the norm using the MPI_SUM operation and then broadcasts the result to all processes • The MPI_Allreduce function can use other operations such as maximum, minimum, product, etc. • Other collective communication functions: • MPI_Bcast, MPI_Reduce • MPI_Gather, MPI_Scatter, MPI_Allgather l_err = vector_norm(l_u, l_new_u, N, l_N); MPI_Allreduce(&l_err, &err, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); err = sqrt(err);
Non-Blocking Communication • Non-blocking communications allow us to post the communications first and return to perform local communications immediately: • The Exchange function is similar to the pervious, except it uses the non-blocking functions MPI_Isend and MPI_Irecv Note: The non-blocking functions also eliminate the need to order the send and receive functions exchange3(l_u, below_points, above_points, N, l_N, np, id, requests); local_jacobi_sweep(l_u, l_new_u, N, l_N, id); MPI_Waitall(4, requests, status); bound_jacobi_sweep(l_u, below_points, above_points, l_new_u, N, l_N, id); void exchange2(double *l_u, double *below_points, double *above_points, int n, int l_n, int np, int id, MPI_Request *requests) { int idbelow, idabove; get_neighbors(&idbelow, &idabove, id, np); MPI_Irecv(below_points, n, MPI_DOUBLE, idbelow, 0, MPI_COMM_WORLD, &requests[0]); MPI_Irecv(above_points, n, MPI_DOUBLE, idabove, 1, MPI_COMM_WORLD, &requests[1]); MPI_Isend(&(l_u[(l_n-1)*n]), n, MPI_DOUBLE, idabove, 0, MPI_COMM_WORLD, &requests[2]); MPI_Isend(l_u, n, MPI_DOUBLE, idbelow, 1, MPI_COMM_WORLD, &requests[3]); }
Performance Measures for Parallel Computing • Speedup Sp(N): How much faster are p processors over 1 processor (for a problem of a fixed size)? Optimal value: Sp(N) = p • Efficiency Ep(N): How close to optimal is the speedup? Optimal value: Ep(N) = 1 = 100% Tp(N) = time for problem size N on p processors • Speedup and efficiency for a fixed problem size are tough measures of parallel performance, because inevitably communication will eventually dominate (for truly parallel code)
Issue: What is Tp(N)? • Parallel program spends time in calculations and in communications. Communication time is affected by latency (initialization of communications) and bandwidth (throughput capability) • Fundamental problem of parallel computing: Communications hurt but are unavoidable (for a truly parallel algorithm), hence we must include them in out timings wall clock tine is used, not CPU time • What wall clock time? Additional issues: OS delays, MPI/network startup, file access for input (1 file read by all processors) and output (all processors write to a file, to where? central or local), etc. • What is T1(N)? Parallel code on a single processor or serial code with the same algorithm or a different “best known” algorithm • Example: Jacobi vs. Gauss-Seidel for linear solve • In summary, two ways to get good speedup: fast parallel code or slow serial timing (due to any reason)
Speedup and Efficiency for the Poisson Problem Blocking Send and Receive
Speedup and Efficiency for the Poisson Problem Non-Blocking Send and Receive
Conclusions • MPI provides a way to write efficient and portable parallel programs • MPI provides many built-in functions that help make programming and collective communications easier • Advanced point-to-point communication functions are also available • The performance increase may be system dependent • For more information on KALI at UMBC go to: http://www.math.umbc.edu/~gobbert/kali/
References • P.S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann, 1997. • M.K. Gobbert. Configuration and Performance of a Beowulf Cluster for Large-Scale Scientific Simulations. Computing in Science and Engineering, vol. 7, no. 2, pp. 14-26, March/April 2005.