1 / 102

High Performance Computing – CISC 811

High Performance Computing – CISC 811. Dr Rob Thacker Dept of Physics (308A) thacker@physics. Assignment 4. MPI programming 1 Will be on website tonight. Today’s Lecture. Distributed Memory Computing II. Part 1: User defined (derived) datatypes, packing, communicators

maja
Download Presentation

High Performance Computing – CISC 811

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics

  2. Assignment 4 • MPI programming 1 • Will be on website tonight

  3. Today’s Lecture Distributed Memory Computing II • Part 1: User defined (derived) datatypes, packing, communicators • Part 2: Abstract topologies, Cartesian primitives, performance and scaling • Part 3: General programming tips & mixed mode programming

  4. MPI_Gatherv & MPI_Allgatherv • Similar to gather, but stride is now allowed • MPI_Gatherv(sendbuf,sendcount,sendtype,recvbuf,recvcounts,displs,recvtype,root,comm,ierr) • Recvcounts and displs are arrays • Result is equivalent to • All processes send message to the root • MPI_Send(sendbuf,sendcount,sendtype,root,…) • Root executes n receives • MPI_Recv(recvbuf+disp[i]*extent(recvtype),recvcounts[i],rectype,I,…) • Data from process j is placed in the jth portion of the receive buffer on the root process • Jth portion of the buffer begins at displs[j] bytes into the recvbuf • MPI_Allgatherv – all processes receive the result

  5. Constant stride example All processors 100 100 100 100 100 100 Root processor STRIDE int sendarray[100], gsize; STRIDE > 100 … MPI_Comm_size(comm, &gsize); rbuf = (int *)malloc(gsize*stride*sizeof(int)); displs = (int *)malloc(gsize*sizeof(int)); rcounts = (int *)malloc(gsize*sizeof(int)); for (i=0;i<gsize;++i) { displs[i] = i*stride; rcounts[i] = 100; } MPI_Gatherv(sendarray,100,MPI_INT,rbuf,rcounts,displs,MPI_INT,root, comm)

  6. Part 1: User-defined (derived) datatypes • MPI comes with plenty of built-in datatypes

  7. Making a trivial new datatype • MPI_Type_contiguous(n,oldtype,newtype,ierror) • Produces a new datatype that is n copies of the old • Each entry is displaced by the extent of oldtype • oldtype can be a derived datatype as well • Must “commit” type before sending • Use MPI_Type_commit(newtype,ierr) • Free up with MPI_Type_Free(newtype,ierr) – finite number of datatypes MPI_Send(buffer,count,datatype,dest,tag,comm,ierr) Equivalent calls MPI_Type_contiguous(count,datatype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)

  8. Arbitrary datatypes • To derive a datatype from another - • Specify a typemap of datatypes and BYTE displacements • The type signature is the list of basic datatypes within the new datatype: • Type signature directly controls the interpretation of both sent and received data, displacements indicate where to find information • Decodes information contained in the buffers

  9. Upper bound, lower bound and extent • Lb - lower bound of displacements, describes position of first element • Ub – position of last byte • Padding may be required to align on word boundaries • Extent – difference between the upper and lower boundaries, may or may not include padding • Describes the stride between instances of the datatype

  10. Padding Example • Suppose we run on a machine that requires ints be on 4 byte boundaries • Typemap={(int,0),(char,4)} • lb=0 • What about upper bound? End of the data-type is at 5 bytes, but next int needs to be on 4 byte boundary • Padding=3 byte => ub=8 • Extent of the datatype is thus 8 Extent Int Char Int Char Pad Pad 4 byte boundaries

  11. Extent is machine dependent • Since padding is machine dependent a subroutine is provide to determine the extent • MPI_Type_extent(datatype,extent,ierror) • Lower and upper bound calls are also provided: • MPI_Type_lb(datatype,displacement,ierror) • MPI_Type_ub(datatype,displacement,ierror) • The total number of bytes contained in the datatype is given by • MPI_Type_size(datatype,extent,ierror) • Remember – extent defines the stride in memory between elements of a given datatype

  12. Ways to create datatypes • Contiguous • Vector • Generalization of contiguous, allows for regular gaps in displacements • Elements separated by multiples of the extent of the input datatype • Hvector • Similar to vector but elements are separated by bytes • Indexed • Array of displacements in the input, displacements are measured in multiples of the input datatype • Hindexed • Similar to indexed, but instead of displacements actual byte positions are used • Struct • Completely general, input is the typemap

  13. Vector datatypes • MPI_Type_vector(count, blocklength, stride, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Each block is blocklength data items long • Separation between successive starting blocks is stride • Function MPI_Type_contiguous can be thought of as a special case of MPI_Type_vector: • MPI_Type_contiguous(count, oldtype, newtype, ierr) = MPI_Type_vector(count, 1, 1, oldtype, newtype,ierr)

  14. Vector Example • Let oldtype = {(double, 0), (char, 8)} (named_double) • Call MPI_Type_vector(2, 3, 4, named_double, six_named_doubles,ierr) • New MPI data type, which has the following map: • newtype = {(double, 0), (char, 8), (double, 16), (char, 24), (double, 32), (char, 40), (double, 64), (char, 72), (double, 80), (char, 88), (double, 96), (char, 104)} • Two blocks of data are taken, each containing 3 structures of the type named_double • The 3 structures within the block are concatenated contiguously • The stride is set to extent of 4 objects of the same type = 64 bytes, since each named_double must have an extent of 16 bytes (word alignment)

  15. Indexed data types • MPI_Type_indexed(count,array_of_blocklengths, array_of_displacements, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Block n is array_of_blocklengths(n) data items long • List of displacements is defined relative to the first item • MPI_Type_vector can be thought of as a special case of MPI_Type_indexed • All array_of_blocklengths entries = vector blocklen • List of displacements are multiples of the stride

  16. Example of using indexing • Suppose you only need a subset of an array • Could send many short messages • Could pack a datatype • Better solution create a datatype the includes the displacements you need • Example: • List of offsets: 9,27,39,142 9 27 39 142

  17. Example cont • In this case we have 4 elements (=n_to_move) • Array of displacements (offset()) • Each part of the list is a single element • `Block’ lengths are all unity (blocksize()) • List datatype will be denoted ltype • Following code performs the implicit packing for you: MPI_Type_indexed(n_to_move,blocksize,offset,ltype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)

  18. When to use H versions • Data maybe stored in dynamically allocated region • No single buffer to refer displacements from • MPI_Address(location,address,ierr) will return the byte address of the element location within an allocated window • Create list of byte positions for Hindexed using MPI_Address • MPI_BOTTOM can then be used as the starting address for the MPI_Send • MPI_Send(MPI_BOTTOM,1,mytype,dest,tag,comm,ierr)

  19. Packing • On some occasions packing may be desirable over specifying a datatype • e.g. packing across multiple arrays • MPI_Pack allows a user to incrementally add data into a user provided buffer • MPI_Pack(inbuf,incount,datatype,outbuf,outcount,position,comm,ierr) • Packed data is given the MPI_PACKED datatype • MPI_Pack_size can be used to determine how large the required buffer is in bytes • Specify number of elements, MPI datatype, communicator • Communicator is required since heterogeneous environment may require more packing than expected

  20. Equivalent examples – both datatypes send the same message int I; char c[100]; MPI_Aint disp[2]; int blocklen[2] = {1, 100}; MPI_Datatype type[2] = {MPI_INT, MPI_CHAR}; MPI_Datatype Type; /*create datatype*/ MPI_Get_address(&I, &(disp[0])); MPI_Get_address(c,&(disp[1])); MPI_Type_create_struct(2, blocklen, disp, type, &Type); MPI_Type_commit(&Type); /* Send */ MPI_Send(MPI_BOTTOM,1,Type,1,0,MPI_COMM_WORLD); Struct datatype int I; char c[100]; char buffer[110]; int position = 0; /* Pack */ MPI_Pack(&i,1,MPI_INT, buffer, 110,&position,MPI_COMM_WORLD); MPI_Pack(c,100,MPI_CHAR,buffer,110,&position,MPI_COMM_WORLD); /*send*/ MPI_Send(buffer,position,MPI_PACKED,1,0,MPI_COMM_WORLD); Packed datatype

  21. Receiving & unpacking packed messages • Note MPI_PACKED datatype can be used to receive any message • To unpack, make succesful calls to MPI_Unpack(inbuf,insize,position,outbuf,outcount, datatype,comm,ierr) • First call starts at position 0 • Second call starts at the position returned by the first MPI_Unpack (value returned by position variable)

  22. MPI_Unpack example int I; char c[100]; MPI_Status status; char buffer[110]; int position = 0; /* receive */ MPI_Recv(buffer,110,MPI_PACKED,1,0,MPI_COMM_WORLD, &status); /*unpack*/ MPI_Unpack(buffer,110,&position,&i,1,MPI_INT,MPI_COMM_WORLD); MPI_Unpack(buffer,110,&position,c,100,MPI_CHAR,MPI_COMM_WORLD);

  23. Note on implementation • A packing unit may contain metadata (recall issue of heterogeneity in communicator) • e.g. 64 processors of standard type, 1 additional graphics processor of opposite endian • Any header will contain information on encoding and the size of the unit for error checking • Implies you can’t concatenate two packed messages and send in a single operation • Similarly can’t separate an incoming message into two sections and unpack separately

  24. Communicators & Processor groups P2 P3 P1 P6 P8 Group 2 COMM 2 Group 1 COMM 1 P4 P5 P7 P9 P10 Group 3 COMM 3 Default WORLD Group MPI_COMM_WORLD

  25. Communicator properties • MPI_COMM_WORLD is the default communicator • In reality each communicator is a handle to an object that collectively describes a group of processes • The communicator is not the group • Definition of a new communicator (sometimes) requires that you first define the group • The communicator facilitates communication within the group • Communicators describe a communication “channel” • The processes in a group are ordered • Every process in the group has a rank • Ranks are ordered from 0 to size-1 • Collective operations are defined within the group

  26. Why would I need more than one communicator? 3 other processes do not communicate any data 3 processes communicate pieces of data time Data 1 Data 2 Data 3 Data 4 Communicator 2 Communicator 4 Communicator 3 Communicator 1

  27. Inter- & intra-communicators • Avoid the natural tendency to think of communicator as encapsulating a number of processes • Remember it is a handle to a communication domain • Communication within a group is facilitated by an intracommunicator • Default communication in MPI_COMM_WORLD is handled by an intracommunicator • Probably 90% of what you want to do can be done with intracommunicators • Communication across groups is facilitated by an intercommunicator • Only point-to-point communication is defined • MPI-2 defines collectives for intercommunicators

  28. Group Management • One way of defining a new communicator requires first that a new group be defined • The default world group can be retrieved via • MPI_Comm_group(comm,group,ierr) • e.g. MPI_Comm_group(MPI_COMM_WORLD,world,ierr) returns (default) world group • New groups are constucted from old ones by • Inclusion • Exclusion • Union • Intersection • Difference • Inclusion over a range • Exclusion over a range • All group definitions must begin from the default world group • Other default group:MPI_GROUP_EMPTY – maybe used as an argument to group creation routines • Groups are freed via MPI_Group_free(group,ierr)

  29. Inclusion / Exclusion • These two calls are the easiest ones to use for deriving groups from the default world group • MPI_Group_incl(group,n,ranks,newgroup,ierr) • Creates newgroup from n processes within group within ranks[0],…,rank[n-1] • Process with rank i in the new group corresponds to ranks[i] in the old group • Example: group={a,b,c,d,e}, ranks=(3,1,2) then handle to group {d,b,c} is returned – note ordering is dependent on ranks[] • n=0 will return MPI_GROUP_EMPTY • MPI_Group_excl(group,n,ranks,newgroup) • Creates a new group that does not contain the processes with ranks[0],…,ranks[n-1] • Example: group={a,b,c,d,e}, ranks=(3,1,2) then handle to group {a,e} is returned • For n=0 the newgroup returned will simply be group

  30. Union and Intersection • MPI_Group_union(grp1,grp2,newgroup,ierr) • All elements of first group followed by all elements of second group • Example: suppose grp1={a,b,c,d}, grp2={a,d,e} • grp1 U grp2 = {a,b,c,d,e} • MPI_Group_intersection{grp1,grp2,newgroup,ierr) • All elements of first group included in second, in the order they are included in the first • Example: grp1 ∩ grp2 = {a,d}

  31. Difference & ranges • MPI_Group_difference(grp1,grp2,newgroup,ierr) • All elements of first not included in the second, ordered in the same order as the first group • Example: suppose grp1={a,b,c,d}, grp2={a,d,e} • grp1\grp2 = {b,c} • MPI_Group_range_incl/excl(group,n,ranges,newgroup,ierr) • Ranges: one dimensional array of integer triplets • Fortran ranges(3,*), C ranges[][3] • Each triplet specifies a range of ranks to be included/excluded

  32. Communicator creation • Communicators can be duplicated via a call to MPI_Comm_dup(comm,newcomm,ierr) • New group communicator - • MPI_Comm_create(comm,group,newcomm) • Requires input of the comm associated with the old group • No attributes (i.e. those associated with virtual topologies) are propagated to the new group • New communicators can be further subdivided through the creation of new groups and communicators

  33. MPI_Comm_split – fast way to define new communicators • Very powerful way of quickly creating (non-overlapping) sub-groups and communicators • User selects number of subgroups required and then provides a list that assigns the required process to the appropriate subgroup (‘color’) • Defines communicators that communicate between a subset of the original comm • Circumvents need to create group • MPI_Comm_split(oldcomm,color,key,newcomm) • Collective operation • Processes passing the same color will be placed into the same communicator • Values of key determine how ranks are assigned in the new communicator • all keys specified the same would mean new communicator ranks were determined by the old one

  34. Creating intercommunicators • Most of the previous methods implicitly create intracommunicators • MPI_Intercomm_create(local_comm,local_leader,peer_comm,remote_leader,tag,newintercomm) • Call is collective over the union of the two groups • Must provide a parent communicator that can be used to facilitate communication between both groups • Which pair of groups is selected is determined by the local_comm arguments 2 1 3 0 Group 2 COMM 2 2 1 3 Group 3 COMM 3

  35. Loose ends - communicator management functions • Don’t forget to free communicators when you’ve finished with them • MPI_Comm_free(comm,ierr) • Can also compare communicators to see if they are identical • MPI_Comm_compare(comm1,comm2,result,ierr)

  36. Summary Part 1 • Communicators are communication channels which operate within groups • New communicators can be derived either from new groups, or via MPI_Comm_split

  37. Applicable to applications where data structures reflect a specific domain decomposition Optional attribute that can be given to an intracommunicator only – not to intercommunicators May possibly help runtime mapping of processes onto hardware We’ll examine one code example that brings together a number of features we have discussed and places them in context Part 2: Topologies, Cartesian Primitives and messaging performance

  38. Inadequacies of linear ranking • Linear ranking implies a ring-like abstract topology for the processes • Few applications have a communication pattern that logically reflects a ring • Two or three dimensional grids/torii are very common • We define the logical processes arrangement defined by the communication pattern the “virtual topology” • Do not confuse the virtual topology with the machine topology • The virtual topology must be mapped to the hardware via an embedding • This may or may not be beneficial

  39. Graphs & Virtual Topologies Cartesian example • Communication pattern defines a graph • Communication graphs are assumed to be symmetric • if a communicates with b, b can communicate with a • Communicators allow messaging between any pair of nodes • virtual topologies may necessarily neglect some communication modes 3x4 2d virtual topology. Upper number is the rank, lower Value gives (column,row).

  40. Specific example: Poisson Equation on 2d grid • xi=i/n+1, where n defines number of points, same in y axis • On unit square grid spacing h=1/n+1 • Equation is approximated by • Solve by guessing for u initially, then iteratively solve for u (Jacobi iteration) interior boundary i,j+1 i,j i-1,j i+1,j i,j-1

  41. 2d Poisson continued Rank=1 • Calculation of the updated uk involves looping over all grid points • How do we decompose the problem in parallel? • Simplest choice: 1-d decomposition (slabs) • Solid lines denote data stored on node • Dotted lines denote all required data required for calculation • Nodes will always require data that they do not hold a copy of locally – ghost cells Rank=0

  42. Changes to the loop structure Single processor Parallel Node • Loop is now over node start and end points • Data arrays must include space for j+1 and j-1 entries in the y direction • Ghost data must be communicated • MPI_Cart topologies create the necessary facilities for this messaging to be done quickly and easily integer i,j,n double precision u(0:n+1,s-1:e+1) double precision unew(0:n+1,s-1:e+1) do j=s,e do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do integer i,j,n double precision u(0:n+1,0:n+1) double precision unew(0:n+1,0:n+1) do j=1,n do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do

  43. MPI_Cart_create • Creates a communicator with a Cartesian topology • 2nd through 5th arguments define the new topology. 1st argument is the initial communicator, 6th argument is the new communicator • Example for a logical 2d grid of processors: integer dims(2) logical isperiodic(2),reorder dims(1)=4 dims(2)=3 isperiodic(1)=.false. isperiodic(2)=.false. reorder=.true. ndim=2 call MPI_Cart_create(MPI_COMM_WORLD, + ndim,dims,isperiodic,reorder, + comm2d,ierr) Dimensions of axes Are axis directions periodic? (consider grid on the Earth for example) Rank in new communicator may be different from old (may improve performance)

  44. Important issue • The topology MPI_Cart_create initializes is related to the communication required by the domain decomposition • Not the implicit communication structure of the algorithm • For the 1d decomposition all x values are contained on node – only y boundaries need be exhanged • In the 2d Poisson example we can choose either a 1d or 2d decomposition

  45. Finding your coordinates & neighbours in the new communicator • MPI_Cart_get(comm1d,1,dims,isperiodic,coords,ierr) • Returns dims, isperiodic as defined in the communicator • Calling processes coords are returned • You may also want to convert a rank to coordinates • MPI_Cart_coords(comm1d,rank,maxdims,coords,ierr) • If you call with your own rank then call returns your coords • When moving data we frequently wish to move across one axis direction synchronously across all processors • If you just want to send and receive ghost cell values to/from neighbours there is a simple mechanism for determining the neighbouring ranks • MPI_Cart_shift

  46. MPI_Cart_shift Rank=1 • MPI_Cart_shift(comm,direction,shift,src,dest,ierr) • Input your rank • Input direction to shift (note numbered from 0 to n-1) • Input displacement • >0=upward shift • <0=downward • Returns rank of neighbour in upper direction (target) • Returns rank of neighbour in lower direction (source) • If no neighbour, MPI_PROC_NULL is returned • Message to this rank is ignored, and completes immediately Rank=0 Boundary exchanges, data must move upward and downward.

  47. Remaining ingredients • Still need to compute the start and end points for each node • If n is divisible by nprocs (best for load balance): • When nprocs does not divide n naïve solution: • Floor(x) returns largest integer value no greater than x • Better to use MPE_Decomp1d to even out, but load balance still won’t be perfect (see the ANL/MSU MPI website) s = 1 + myrank * (n/nprocs) e = s + (n / nprocs) -1 s = 1 + myrank * (n/nprocs) if (myrank .eq. nprocs-1) then e = n else e = s + floor(n / nprocs) -1 end if

  48. http://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htmhttp://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htm Prototype code call MPI_CART_CREATE( MPI_COMM_WORLD, 1, numprocs, .false., $ .true., comm1d, ierr ) c c Get my position in this communicator, and my neighbors c call MPI_COMM_RANK( comm1d, myid, ierr ) call MPI_Cart_shift( comm1d, 0, 1, nbrbottom, nbrtop, ierr ) c Compute the actual decomposition call MPE_DECOMP1D( ny, numprocs, myid, s, e ) c Initialize the right-hand-side (f) and the initial solution guess (a) call onedinit( a, b, f, nx, s, e ) c c Actually do the computation. Note the use of a collective operation to c check for convergence, and a do-loop to bound the number of iterations. c do 10 it=1, maxit call exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( a, f, nx, s, e, b ) ! Jacobi sweep call exchng1( b, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( b, f, nx, s, e, a ) dwork = diff( a, b, nx, s, e ) ! Convergence test call MPI_Allreduce( dwork, diffnorm, 1, MPI_DOUBLE_PRECISION, $ MPI_SUM, comm1d, ierr ) if (diffnorm .lt. 1.0e-5) goto 20 c if (myid .eq. 0) print *, 2*it, ' Difference is ', diffnorm 10 continue if (myid .eq. 0) print *, 'Failed to converge' 20 continue if (myid .eq. 0) then print *, 'Converged after ', 2*it, ' Iterations’ endif Routine alternately computes into a and then b – hence 2*it at end

  49. Which routine determines performance (other than comp. kernel)? • exchng1 controls the messaging and how this is coded will strongly affect overall execution time • Recall from last lecture, 5(!) different ways to do communication: • Blocking sends and receives • very bad idea, potentially unsafe (although in this case it is OK) • Ordered send • match send and receive perfectly • Sendrecv • systems deals with buffering • Buffered Bsend • Nonblocking Isend

  50. Option #1: Simple sends and receives subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr C call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) return end

More Related