High Performance Parallel Programming

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division

High Performance Parallel Programming • Lecture 4: Message Passing Interface 3 High Performance Parallel Programming

So Far.. • Messages • source, dest, data, tag, communicator • Communicators • MPI_COMM_WORLD • Point-to-point communications • different modes - standard, synchronous, buffered, ready • blocking vs non-blocking • Derived datatypes • construct then commit High Performance Parallel Programming

Ping-pong exercise: program /********************************************************************** * This file has been written as a sample solution to an exercise in a * course given at the Edinburgh Parallel Computing Centre. It is made * freely available with the understanding that every copy of this file * must include this header and that EPCC takes no responsibility for * the use of the enclosed teaching material. * * Authors: Joel Malard, Alan Simpson * * Contact: epcc-tec@epcc.ed.ac.uk * * Purpose: A program to experiment with point-to-point * communications. * * Contents: C source code. * ********************************************************************/ High Performance Parallel Programming

#include <stdio.h> #include <mpi.h> #define proc_A 0 #define proc_B 1 #define ping 101 #define pong 101 float buffer[100000]; long float_size; void processor_A (void), processor_B (void); void main ( int argc, char *argv[] ) { int ierror, rank, size; extern long float_size; MPI_Init(&argc, &argv); MPI_Type_extent(MPI_FLOAT, &float_size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == proc_A) processor_A(); else if (rank == proc_B) processor_B(); MPI_Finalize(); }

void processor_A( void ) { int i, length, ierror; MPI_Status status; double start, finish, time; extern float buffer[100000]; extern long float_size; printf("Length\tTotal Time\tTransfer Rate\n"); for (length = 1; length <= 100000; length += 1000){ start = MPI_Wtime(); for (i = 1; i <= 100; i++){ MPI_Ssend(buffer, length, MPI_FLOAT, proc_B, ping, MPI_COMM_WORLD); MPI_Recv(buffer, length, MPI_FLOAT, proc_B, pong, MPI_COMM_WORLD, &status); } finish = MPI_Wtime(); time = finish - start; printf("%d\t%f\t%f\n", length, time/200., (float)(2 * float_size * 100 * length)/time); } }

void processor_B( void ) { int i, length, ierror; MPI_Status status; extern float buffer[100000]; for (length = 1; length <= 100000; length += 1000) { for (i = 1; i <= 100; i++) { MPI_Recv(buffer, length, MPI_FLOAT, proc_A, ping, MPI_COMM_WORLD, &status); MPI_Ssend(buffer, length, MPI_FLOAT, proc_A, pong, MPI_COMM_WORLD); } } }

Ping-pong exercise: results High Performance Parallel Programming

Ping-pong exercise: results 2 High Performance Parallel Programming

Running ping-pong compile: mpicc ping_pong.c -o ping_pong submit: qsub ping_pong.sh where ping_pong.sh is #PBS -q exclusive #PBS -l nodes=2 cd <your sub_directory> mpirun ping_pong High Performance Parallel Programming

Collective communication • Communications involving a group of processes • Called by all processes in a communicator • for sub-groups need to form a new communicator • Examples • Barrier synchronisation • Broadcast, Scatter, Gather • Global sum, Global maximum, etc. High Performance Parallel Programming

Characteristics • Collective action over a communicator • All processes must communicate • Synchronisation may or may not occur • All collective operations are blocking • No tags • Recieve buffers must be exactly the right size • Collective communications and point-to-point communications cannot interfere High Performance Parallel Programming

MPI_Barrier • Blocks each calling process until all other members have also called it. • Generally used to synchronise between phases of a program • Only one argument - no data is exchanged MPI_Barrier(comm) High Performance Parallel Programming

Broadcast • Copies data from a specified root process to all other processes in communicator • all processes must specify the same root • other aguments same as for point-to-point • datatypes and sizes must match MPI_Bcast(buffer, count, datatype, root, comm) • Note: MPI does not support a multicast function High Performance Parallel Programming

a a b b c c d d e e before after a b c d e Scatter, Gather • Scatter and Gather are inverse operations • Note that all processes partake - even root Scatter: High Performance Parallel Programming

a b c d e before a b c d e after a b c d e Gather Gather: High Performance Parallel Programming

MPI_Scatter, MPI_Gather MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) • sendcount in scatter and recvcount in gatherrefer to the size of each individual message (sendtype = recvtype => sendcount = recvcount) • total type signatures must match High Performance Parallel Programming

Example MPI_Comm comm; int gsize, sendarray[100]; int root, myrank, *rbuf; MPI_Datatype rtype; ... MPI_Comm_rank(comm, myrank); MPI_Comm_size(comm, &gsize); MPI_Type_contigous(100, MPI_INT, &rtype); MPI_Type_commit(&rtype); if (myrank == root) { rbuf = (int *)malloc(gsize*100*sizeof(int)); } MPI_Gather(sendarray, 100, MPI_INT, rbuf, 1, rtype, root, comm); High Performance Parallel Programming

a b c d e d f u c a e b p a a k a a a u a f k a p l b b b q v b b g h j g f b i b q v g l c m h c m c c r l h w o w k r c c n c m s x d d d i p n n x s d r d i s d d t q e e o t j u v w y e y e e x y t o j e e a b c d e More routines MPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) High Performance Parallel Programming

Vector routines MPI_Scatterv(sendbuf, sendcount, displs, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Gatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, root, comm) MPI_Allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, comm) MPI_Alltoallv(sendbuf, sendcount, sdispls, sendtype, recvbuf, recvcount, rdispls, recvtype, comm) • Allow send/recv to be from/to non-contiguous locationsin an array • Useful if sending different counts at different times High Performance Parallel Programming

Global reduction routines • Used to compute a result which depends on data distributed over a number of processes • Examples: • global sum or product • global maximum or minimum • global user-defined operation • Operation should be associative • aside: remember floating-point operations technically aren’t associative but we usually don’t care - can affect results in parallel programs though High Performance Parallel Programming

Global reduction (cont.) MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm) • combines count elements from each sendbuf using op and leaves results in recvbuf on process root • e.g. MPI_Reduce(&s, &r, 2, MPI_INT, MPI_SUM, 1, comm) r r r r r 2 3 1 1 3 2 1 1 1 2 s s s s s r r r r r 2 2 1 1 3 3 1 1 1 1 s s s s s 8 9 High Performance Parallel Programming

Reduction operators MPI_MAX Maximum MPI_MIN Minumum MPI_SUM Sum MPI_PROD Product MPI_LAND Logical AND MPI_BAND Bitewise AND MPI_LOR Logical OR MPI_BOR Bitwise OR MPI_LXOR Logical XOR MPI_BXOR Bitwise XOR MPI_MAXLOC Max value and location MPI_MINLOC Min value and location High Performance Parallel Programming

User-defined operators In C the operator is defined as a function of type typedef void MPI_User_function(void *invec, void *inoutvec, int *len, MPI_Datatype *datatype); In Fortran must write a function as function <user_function>(invec(*), inoutvec(*), len, type) where the function has the following schema for (i = 1 to len) inoutvec(i) = inoutvec(i) op invec(i) Then MPI_Op_create(user_function, commute, op) returns a handle op of type MPI_Op High Performance Parallel Programming

Variants MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, comm) • All processes invloved receive identical results MPI_Reduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm) • Acts as if a reduce was performed and then each process recieves recvcount(myrank) elements of the result. High Performance Parallel Programming

1 2 1 7 1 2 1 2 2 1 1 2 3 9 1 3 1 2 1 2 1 2 2 1 1 1 6 1 1 1 2 1 1 2 3 9 3 1 2 2 1 1 2 1 3 2 2 1 1 2 9 1 1 2 3 Reduce-scatter MPI_INT *s, *r, *rc; int rank, gsize; ... rc = (/ 1, 2, 0, 1, 1 /) MPI_Reduce-scatter(s, r, rc, MPI_INT, MPI_SUM, comm) High Performance Parallel Programming

1 2 7 5 1 3 1 2 2 2 1 1 2 1 1 2 2 1 3 9 1 1 2 6 2 3 1 1 7 3 3 1 1 5 1 1 6 2 2 1 2 1 1 4 1 9 3 6 1 4 3 2 1 2 1 3 2 1 1 2 3 2 2 1 1 3 2 1 1 3 9 7 8 5 2 Scan MPI_Scan(sendbuf, recvbuf, count, datatype, op, comm) • Performs a prefix reduction on data across group recvbuf(myrank) = op(sendbuf((i,i=1,myrank))) MPI_Scan(&s, &r, 5, MPI_INT, MPI_SUM, comm); High Performance Parallel Programming

Further topics • Error-handling • Errors are handled by an error handler • MPI_ERRORS_ARE_FATAL - default for MPI_COMM_WORLD • MPI_ERRORS_RETURN - MPI state is undefined • MPI_Error_string(errorcode, string, resultlen) • Message probing • Messages can be probed • Note - wildcard reads may receive a different message • blocking and non-blocking • Persistent communications High Performance Parallel Programming

Assignment 2. • Write a general procedure to multiply 2 matrices. • Start with • http://www.hpc.unimelb.edu.au/cs/assignment2/ • This is a harness for last years assignment • Last year I asked them to optimise first • This year just parallelize • Next Tuesday I will discuss strategies • That doesn’t mean don’t start now… • Ideas available in various places… High Performance Parallel Programming

High Performance Parallel Programming Tomorrow - matrix multiplication High Performance Parallel Programming

High Performance Parallel Programming