High Performance Computing – CISC 811

High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics

Assignment 4 • MPI programming 1 • Will be on website tonight

Today’s Lecture Distributed Memory Computing II • Part 1: User defined (derived) datatypes, packing, communicators • Part 2: Abstract topologies, Cartesian primitives, performance and scaling • Part 3: General programming tips & mixed mode programming

MPI_Gatherv & MPI_Allgatherv • Similar to gather, but stride is now allowed • MPI_Gatherv(sendbuf,sendcount,sendtype,recvbuf,recvcounts,displs,recvtype,root,comm,ierr) • Recvcounts and displs are arrays • Result is equivalent to • All processes send message to the root • MPI_Send(sendbuf,sendcount,sendtype,root,…) • Root executes n receives • MPI_Recv(recvbuf+disp[i]*extent(recvtype),recvcounts[i],rectype,I,…) • Data from process j is placed in the jth portion of the receive buffer on the root process • Jth portion of the buffer begins at displs[j] bytes into the recvbuf • MPI_Allgatherv – all processes receive the result

Constant stride example All processors 100 100 100 100 100 100 Root processor STRIDE int sendarray[100], gsize; STRIDE > 100 … MPI_Comm_size(comm, &gsize); rbuf = (int *)malloc(gsize*stride*sizeof(int)); displs = (int *)malloc(gsize*sizeof(int)); rcounts = (int *)malloc(gsize*sizeof(int)); for (i=0;i<gsize;++i) { displs[i] = i*stride; rcounts[i] = 100; } MPI_Gatherv(sendarray,100,MPI_INT,rbuf,rcounts,displs,MPI_INT,root, comm)

Part 1: User-defined (derived) datatypes • MPI comes with plenty of built-in datatypes

Making a trivial new datatype • MPI_Type_contiguous(n,oldtype,newtype,ierror) • Produces a new datatype that is n copies of the old • Each entry is displaced by the extent of oldtype • oldtype can be a derived datatype as well • Must “commit” type before sending • Use MPI_Type_commit(newtype,ierr) • Free up with MPI_Type_Free(newtype,ierr) – finite number of datatypes MPI_Send(buffer,count,datatype,dest,tag,comm,ierr) Equivalent calls MPI_Type_contiguous(count,datatype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)

Arbitrary datatypes • To derive a datatype from another - • Specify a typemap of datatypes and BYTE displacements • The type signature is the list of basic datatypes within the new datatype: • Type signature directly controls the interpretation of both sent and received data, displacements indicate where to find information • Decodes information contained in the buffers

Upper bound, lower bound and extent • Lb - lower bound of displacements, describes position of first element • Ub – position of last byte • Padding may be required to align on word boundaries • Extent – difference between the upper and lower boundaries, may or may not include padding • Describes the stride between instances of the datatype

Padding Example • Suppose we run on a machine that requires ints be on 4 byte boundaries • Typemap={(int,0),(char,4)} • lb=0 • What about upper bound? End of the data-type is at 5 bytes, but next int needs to be on 4 byte boundary • Padding=3 byte => ub=8 • Extent of the datatype is thus 8 Extent Int Char Int Char Pad Pad 4 byte boundaries

Extent is machine dependent • Since padding is machine dependent a subroutine is provide to determine the extent • MPI_Type_extent(datatype,extent,ierror) • Lower and upper bound calls are also provided: • MPI_Type_lb(datatype,displacement,ierror) • MPI_Type_ub(datatype,displacement,ierror) • The total number of bytes contained in the datatype is given by • MPI_Type_size(datatype,extent,ierror) • Remember – extent defines the stride in memory between elements of a given datatype

Ways to create datatypes • Contiguous • Vector • Generalization of contiguous, allows for regular gaps in displacements • Elements separated by multiples of the extent of the input datatype • Hvector • Similar to vector but elements are separated by bytes • Indexed • Array of displacements in the input, displacements are measured in multiples of the input datatype • Hindexed • Similar to indexed, but instead of displacements actual byte positions are used • Struct • Completely general, input is the typemap

Vector datatypes • MPI_Type_vector(count, blocklength, stride, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Each block is blocklength data items long • Separation between successive starting blocks is stride • Function MPI_Type_contiguous can be thought of as a special case of MPI_Type_vector: • MPI_Type_contiguous(count, oldtype, newtype, ierr) = MPI_Type_vector(count, 1, 1, oldtype, newtype,ierr)

Vector Example • Let oldtype = {(double, 0), (char, 8)} (named_double) • Call MPI_Type_vector(2, 3, 4, named_double, six_named_doubles,ierr) • New MPI data type, which has the following map: • newtype = {(double, 0), (char, 8), (double, 16), (char, 24), (double, 32), (char, 40), (double, 64), (char, 72), (double, 80), (char, 88), (double, 96), (char, 104)} • Two blocks of data are taken, each containing 3 structures of the type named_double • The 3 structures within the block are concatenated contiguously • The stride is set to extent of 4 objects of the same type = 64 bytes, since each named_double must have an extent of 16 bytes (word alignment)

Indexed data types • MPI_Type_indexed(count,array_of_blocklengths, array_of_displacements, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Block n is array_of_blocklengths(n) data items long • List of displacements is defined relative to the first item • MPI_Type_vector can be thought of as a special case of MPI_Type_indexed • All array_of_blocklengths entries = vector blocklen • List of displacements are multiples of the stride

Example of using indexing • Suppose you only need a subset of an array • Could send many short messages • Could pack a datatype • Better solution create a datatype the includes the displacements you need • Example: • List of offsets: 9,27,39,142 9 27 39 142

Example cont • In this case we have 4 elements (=n_to_move) • Array of displacements (offset()) • Each part of the list is a single element • `Block’ lengths are all unity (blocksize()) • List datatype will be denoted ltype • Following code performs the implicit packing for you: MPI_Type_indexed(n_to_move,blocksize,offset,ltype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)

When to use H versions • Data maybe stored in dynamically allocated region • No single buffer to refer displacements from • MPI_Address(location,address,ierr) will return the byte address of the element location within an allocated window • Create list of byte positions for Hindexed using MPI_Address • MPI_BOTTOM can then be used as the starting address for the MPI_Send • MPI_Send(MPI_BOTTOM,1,mytype,dest,tag,comm,ierr)

Packing • On some occasions packing may be desirable over specifying a datatype • e.g. packing across multiple arrays • MPI_Pack allows a user to incrementally add data into a user provided buffer • MPI_Pack(inbuf,incount,datatype,outbuf,outcount,position,comm,ierr) • Packed data is given the MPI_PACKED datatype • MPI_Pack_size can be used to determine how large the required buffer is in bytes • Specify number of elements, MPI datatype, communicator • Communicator is required since heterogeneous environment may require more packing than expected

Equivalent examples – both datatypes send the same message int I; char c[100]; MPI_Aint disp[2]; int blocklen[2] = {1, 100}; MPI_Datatype type[2] = {MPI_INT, MPI_CHAR}; MPI_Datatype Type; /*create datatype*/ MPI_Get_address(&I, &(disp[0])); MPI_Get_address(c,&(disp[1])); MPI_Type_create_struct(2, blocklen, disp, type, &Type); MPI_Type_commit(&Type); /* Send */ MPI_Send(MPI_BOTTOM,1,Type,1,0,MPI_COMM_WORLD); Struct datatype int I; char c[100]; char buffer[110]; int position = 0; /* Pack */ MPI_Pack(&i,1,MPI_INT, buffer, 110,&position,MPI_COMM_WORLD); MPI_Pack(c,100,MPI_CHAR,buffer,110,&position,MPI_COMM_WORLD); /*send*/ MPI_Send(buffer,position,MPI_PACKED,1,0,MPI_COMM_WORLD); Packed datatype

Receiving & unpacking packed messages • Note MPI_PACKED datatype can be used to receive any message • To unpack, make succesful calls to MPI_Unpack(inbuf,insize,position,outbuf,outcount, datatype,comm,ierr) • First call starts at position 0 • Second call starts at the position returned by the first MPI_Unpack (value returned by position variable)

MPI_Unpack example int I; char c[100]; MPI_Status status; char buffer[110]; int position = 0; /* receive */ MPI_Recv(buffer,110,MPI_PACKED,1,0,MPI_COMM_WORLD, &status); /*unpack*/ MPI_Unpack(buffer,110,&position,&i,1,MPI_INT,MPI_COMM_WORLD); MPI_Unpack(buffer,110,&position,c,100,MPI_CHAR,MPI_COMM_WORLD);

Note on implementation • A packing unit may contain metadata (recall issue of heterogeneity in communicator) • e.g. 64 processors of standard type, 1 additional graphics processor of opposite endian • Any header will contain information on encoding and the size of the unit for error checking • Implies you can’t concatenate two packed messages and send in a single operation • Similarly can’t separate an incoming message into two sections and unpack separately

Communicators & Processor groups P2 P3 P1 P6 P8 Group 2 COMM 2 Group 1 COMM 1 P4 P5 P7 P9 P10 Group 3 COMM 3 Default WORLD Group MPI_COMM_WORLD

Communicator properties • MPI_COMM_WORLD is the default communicator • In reality each communicator is a handle to an object that collectively describes a group of processes • The communicator is not the group • Definition of a new communicator (sometimes) requires that you first define the group • The communicator facilitates communication within the group • Communicators describe a communication “channel” • The processes in a group are ordered • Every process in the group has a rank • Ranks are ordered from 0 to size-1 • Collective operations are defined within the group

Why would I need more than one communicator? 3 other processes do not communicate any data 3 processes communicate pieces of data time Data 1 Data 2 Data 3 Data 4 Communicator 2 Communicator 4 Communicator 3 Communicator 1

Inter- & intra-communicators • Avoid the natural tendency to think of communicator as encapsulating a number of processes • Remember it is a handle to a communication domain • Communication within a group is facilitated by an intracommunicator • Default communication in MPI_COMM_WORLD is handled by an intracommunicator • Probably 90% of what you want to do can be done with intracommunicators • Communication across groups is facilitated by an intercommunicator • Only point-to-point communication is defined • MPI-2 defines collectives for intercommunicators

Group Management • One way of defining a new communicator requires first that a new group be defined • The default world group can be retrieved via • MPI_Comm_group(comm,group,ierr) • e.g. MPI_Comm_group(MPI_COMM_WORLD,world,ierr) returns (default) world group • New groups are constucted from old ones by • Inclusion • Exclusion • Union • Intersection • Difference • Inclusion over a range • Exclusion over a range • All group definitions must begin from the default world group • Other default group:MPI_GROUP_EMPTY – maybe used as an argument to group creation routines • Groups are freed via MPI_Group_free(group,ierr)

Inclusion / Exclusion • These two calls are the easiest ones to use for deriving groups from the default world group • MPI_Group_incl(group,n,ranks,newgroup,ierr) • Creates newgroup from n processes within group within ranks[0],…,rank[n-1] • Process with rank i in the new group corresponds to ranks[i] in the old group • Example: group={a,b,c,d,e}, ranks=(3,1,2) then handle to group {d,b,c} is returned – note ordering is dependent on ranks[] • n=0 will return MPI_GROUP_EMPTY • MPI_Group_excl(group,n,ranks,newgroup) • Creates a new group that does not contain the processes with ranks[0],…,ranks[n-1] • Example: group={a,b,c,d,e}, ranks=(3,1,2) then handle to group {a,e} is returned • For n=0 the newgroup returned will simply be group

Union and Intersection • MPI_Group_union(grp1,grp2,newgroup,ierr) • All elements of first group followed by all elements of second group • Example: suppose grp1={a,b,c,d}, grp2={a,d,e} • grp1 U grp2 = {a,b,c,d,e} • MPI_Group_intersection{grp1,grp2,newgroup,ierr) • All elements of first group included in second, in the order they are included in the first • Example: grp1 ∩ grp2 = {a,d}

Difference & ranges • MPI_Group_difference(grp1,grp2,newgroup,ierr) • All elements of first not included in the second, ordered in the same order as the first group • Example: suppose grp1={a,b,c,d}, grp2={a,d,e} • grp1\grp2 = {b,c} • MPI_Group_range_incl/excl(group,n,ranges,newgroup,ierr) • Ranges: one dimensional array of integer triplets • Fortran ranges(3,*), C ranges[][3] • Each triplet specifies a range of ranks to be included/excluded

Communicator creation • Communicators can be duplicated via a call to MPI_Comm_dup(comm,newcomm,ierr) • New group communicator - • MPI_Comm_create(comm,group,newcomm) • Requires input of the comm associated with the old group • No attributes (i.e. those associated with virtual topologies) are propagated to the new group • New communicators can be further subdivided through the creation of new groups and communicators

MPI_Comm_split – fast way to define new communicators • Very powerful way of quickly creating (non-overlapping) sub-groups and communicators • User selects number of subgroups required and then provides a list that assigns the required process to the appropriate subgroup (‘color’) • Defines communicators that communicate between a subset of the original comm • Circumvents need to create group • MPI_Comm_split(oldcomm,color,key,newcomm) • Collective operation • Processes passing the same color will be placed into the same communicator • Values of key determine how ranks are assigned in the new communicator • all keys specified the same would mean new communicator ranks were determined by the old one

Creating intercommunicators • Most of the previous methods implicitly create intracommunicators • MPI_Intercomm_create(local_comm,local_leader,peer_comm,remote_leader,tag,newintercomm) • Call is collective over the union of the two groups • Must provide a parent communicator that can be used to facilitate communication between both groups • Which pair of groups is selected is determined by the local_comm arguments 2 1 3 0 Group 2 COMM 2 2 1 3 Group 3 COMM 3

Loose ends - communicator management functions • Don’t forget to free communicators when you’ve finished with them • MPI_Comm_free(comm,ierr) • Can also compare communicators to see if they are identical • MPI_Comm_compare(comm1,comm2,result,ierr)

Summary Part 1 • Communicators are communication channels which operate within groups • New communicators can be derived either from new groups, or via MPI_Comm_split

Applicable to applications where data structures reflect a specific domain decomposition Optional attribute that can be given to an intracommunicator only – not to intercommunicators May possibly help runtime mapping of processes onto hardware We’ll examine one code example that brings together a number of features we have discussed and places them in context Part 2: Topologies, Cartesian Primitives and messaging performance

Inadequacies of linear ranking • Linear ranking implies a ring-like abstract topology for the processes • Few applications have a communication pattern that logically reflects a ring • Two or three dimensional grids/torii are very common • We define the logical processes arrangement defined by the communication pattern the “virtual topology” • Do not confuse the virtual topology with the machine topology • The virtual topology must be mapped to the hardware via an embedding • This may or may not be beneficial

Graphs & Virtual Topologies Cartesian example • Communication pattern defines a graph • Communication graphs are assumed to be symmetric • if a communicates with b, b can communicate with a • Communicators allow messaging between any pair of nodes • virtual topologies may necessarily neglect some communication modes 3x4 2d virtual topology. Upper number is the rank, lower Value gives (column,row).

Specific example: Poisson Equation on 2d grid • xi=i/n+1, where n defines number of points, same in y axis • On unit square grid spacing h=1/n+1 • Equation is approximated by • Solve by guessing for u initially, then iteratively solve for u (Jacobi iteration) interior boundary i,j+1 i,j i-1,j i+1,j i,j-1

2d Poisson continued Rank=1 • Calculation of the updated uk involves looping over all grid points • How do we decompose the problem in parallel? • Simplest choice: 1-d decomposition (slabs) • Solid lines denote data stored on node • Dotted lines denote all required data required for calculation • Nodes will always require data that they do not hold a copy of locally – ghost cells Rank=0

Changes to the loop structure Single processor Parallel Node • Loop is now over node start and end points • Data arrays must include space for j+1 and j-1 entries in the y direction • Ghost data must be communicated • MPI_Cart topologies create the necessary facilities for this messaging to be done quickly and easily integer i,j,n double precision u(0:n+1,s-1:e+1) double precision unew(0:n+1,s-1:e+1) do j=s,e do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do integer i,j,n double precision u(0:n+1,0:n+1) double precision unew(0:n+1,0:n+1) do j=1,n do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do

MPI_Cart_create • Creates a communicator with a Cartesian topology • 2nd through 5th arguments define the new topology. 1st argument is the initial communicator, 6th argument is the new communicator • Example for a logical 2d grid of processors: integer dims(2) logical isperiodic(2),reorder dims(1)=4 dims(2)=3 isperiodic(1)=.false. isperiodic(2)=.false. reorder=.true. ndim=2 call MPI_Cart_create(MPI_COMM_WORLD, + ndim,dims,isperiodic,reorder, + comm2d,ierr) Dimensions of axes Are axis directions periodic? (consider grid on the Earth for example) Rank in new communicator may be different from old (may improve performance)

Important issue • The topology MPI_Cart_create initializes is related to the communication required by the domain decomposition • Not the implicit communication structure of the algorithm • For the 1d decomposition all x values are contained on node – only y boundaries need be exhanged • In the 2d Poisson example we can choose either a 1d or 2d decomposition

Finding your coordinates & neighbours in the new communicator • MPI_Cart_get(comm1d,1,dims,isperiodic,coords,ierr) • Returns dims, isperiodic as defined in the communicator • Calling processes coords are returned • You may also want to convert a rank to coordinates • MPI_Cart_coords(comm1d,rank,maxdims,coords,ierr) • If you call with your own rank then call returns your coords • When moving data we frequently wish to move across one axis direction synchronously across all processors • If you just want to send and receive ghost cell values to/from neighbours there is a simple mechanism for determining the neighbouring ranks • MPI_Cart_shift

MPI_Cart_shift Rank=1 • MPI_Cart_shift(comm,direction,shift,src,dest,ierr) • Input your rank • Input direction to shift (note numbered from 0 to n-1) • Input displacement • >0=upward shift • <0=downward • Returns rank of neighbour in upper direction (target) • Returns rank of neighbour in lower direction (source) • If no neighbour, MPI_PROC_NULL is returned • Message to this rank is ignored, and completes immediately Rank=0 Boundary exchanges, data must move upward and downward.

Remaining ingredients • Still need to compute the start and end points for each node • If n is divisible by nprocs (best for load balance): • When nprocs does not divide n naïve solution: • Floor(x) returns largest integer value no greater than x • Better to use MPE_Decomp1d to even out, but load balance still won’t be perfect (see the ANL/MSU MPI website) s = 1 + myrank * (n/nprocs) e = s + (n / nprocs) -1 s = 1 + myrank * (n/nprocs) if (myrank .eq. nprocs-1) then e = n else e = s + floor(n / nprocs) -1 end if

http://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htmhttp://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htm Prototype code call MPI_CART_CREATE( MPI_COMM_WORLD, 1, numprocs, .false., $ .true., comm1d, ierr ) c c Get my position in this communicator, and my neighbors c call MPI_COMM_RANK( comm1d, myid, ierr ) call MPI_Cart_shift( comm1d, 0, 1, nbrbottom, nbrtop, ierr ) c Compute the actual decomposition call MPE_DECOMP1D( ny, numprocs, myid, s, e ) c Initialize the right-hand-side (f) and the initial solution guess (a) call onedinit( a, b, f, nx, s, e ) c c Actually do the computation. Note the use of a collective operation to c check for convergence, and a do-loop to bound the number of iterations. c do 10 it=1, maxit call exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( a, f, nx, s, e, b ) ! Jacobi sweep call exchng1( b, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( b, f, nx, s, e, a ) dwork = diff( a, b, nx, s, e ) ! Convergence test call MPI_Allreduce( dwork, diffnorm, 1, MPI_DOUBLE_PRECISION, $ MPI_SUM, comm1d, ierr ) if (diffnorm .lt. 1.0e-5) goto 20 c if (myid .eq. 0) print *, 2*it, ' Difference is ', diffnorm 10 continue if (myid .eq. 0) print *, 'Failed to converge' 20 continue if (myid .eq. 0) then print *, 'Converged after ', 2*it, ' Iterations’ endif Routine alternately computes into a and then b – hence 2*it at end

Which routine determines performance (other than comp. kernel)? • exchng1 controls the messaging and how this is coded will strongly affect overall execution time • Recall from last lecture, 5(!) different ways to do communication: • Blocking sends and receives • very bad idea, potentially unsafe (although in this case it is OK) • Ordered send • match send and receive perfectly • Sendrecv • systems deals with buffering • Buffered Bsend • Nonblocking Isend

Option #1: Simple sends and receives subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr C call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) return end

High Performance Computing – CISC 811