510 likes | 557 Views
Computational Methods in Astrophysics. Dr Rob Thacker (AT319E) thacker@ap. Today’s Lecture. Distributed Memory Computing II. User defined (derived) datatypes, packing, communicators Abstract topologies, Cartesian primitives, performance and scaling. Build defined types from built-in.
E N D
Computational Methods in Astrophysics Dr Rob Thacker (AT319E) thacker@ap
Today’s Lecture Distributed Memory Computing II • User defined (derived) datatypes, packing, communicators • Abstract topologies, Cartesian primitives, performance and scaling
Build defined types from built-in • MPI comes with plenty of built-in datatypes
Making a trivial new datatype • MPI_Type_contiguous(n,oldtype,newtype,ierror) • Produces a new datatype that is n copies of the old • Each entry is displaced by the extent of oldtype • oldtype can be a derived datatype as well • Must “commit” type before sending • Use MPI_Type_commit(newtype,ierr) • Free up with MPI_Type_Free(newtype,ierr) – finite number of datatypes MPI_Send(buffer,count,datatype,dest,tag,comm,ierr) Equivalent calls MPI_Type_contiguous(count,datatype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)
Arbitrary datatypes • To derive a datatype from another - • Specify a typemap of datatypes and BYTE displacements • The type signature is the list of basic datatypes within the new datatype: • Type signature directly controls the interpretation of both sent and received data, displacements indicate where to find information • Decodes information contained in the buffers
Vector datatypes • MPI_Type_vector(count, blocklength, stride, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Each block is blocklength data items long • Separation between successive starting blocks is stride • Function MPI_Type_contiguous can be thought of as a special case of MPI_Type_vector: • MPI_Type_contiguous(count, oldtype, newtype, ierr) = MPI_Type_vector(count, 1, 1, oldtype, newtype,ierr)
Vector Example • Let oldtype = {(double, 0), (char, 8)} (named_double) • Call MPI_Type_vector(2, 3, 4, named_double, six_named_doubles,ierr) • New MPI data type, which has the following map: • newtype = {(double, 0), (char, 8), (double, 16), (char, 24), (double, 32), (char, 40), (double, 64), (char, 72), (double, 80), (char, 88), (double, 96), (char, 104)} • Two blocks of data are taken, each containing 3 structures of the type named_double • The 3 structures within the block are concatenated contiguously • The stride is set to extent of 4 objects of the same type = 64 bytes, since each named_double must have an extent of 16 bytes (word alignment)
Indexed data types • MPI_Type_indexed(count,array_of_blocklengths, array_of_displacements, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Block n is array_of_blocklengths(n) data items long • List of displacements is defined relative to the first item • MPI_Type_vector can be thought of as a special case of MPI_Type_indexed • All array_of_blocklengths entries = vector blocklen • List of displacements are multiples of the stride
Example of using indexing • Suppose you only need a subset of an array • Could send many short messages • Could pack a datatype • Better solution create a datatype the includes the displacements you need • Example: • List of offsets: 9,27,39,142 9 27 39 142
Example cont • In this case we have 4 elements (=n_to_move) • Array of displacements (offset()) • Each part of the list is a single element • `Block’ lengths are all unity (blocksize()) • List datatype will be denoted ltype • Following code performs the implicit packing for you: MPI_Type_indexed(n_to_move,blocksize,offset,ltype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)
When to use H versions • Data maybe stored in dynamically allocated region • No single buffer to refer displacements from • MPI_Address(location,address,ierr) will return the byte address of the element location within an allocated window • Create list of byte positions for Hindexed using MPI_Address • MPI_BOTTOM can then be used as the starting address for the MPI_Send • MPI_Send(MPI_BOTTOM,1,mytype,dest,tag,comm,ierr)
Packing • On some occasions packing may be desirable over specifying a datatype • e.g. packing across multiple arrays • MPI_Pack allows a user to incrementally add data into a user provided buffer • MPI_Pack(inbuf,incount,datatype,outbuf,outcount,position,comm,ierr) • Packed data is given the MPI_PACKED datatype • MPI_Pack_size can be used to determine how large the required buffer is in bytes • Specify number of elements, MPI datatype, communicator • Communicator is required since heterogeneous environment may require more packing than expected
Equivalent examples – both datatypes send the same message int I; char c[100]; MPI_Aint disp[2]; int blocklen[2] = {1, 100}; MPI_Datatype type[2] = {MPI_INT, MPI_CHAR}; MPI_Datatype Type; /*create datatype*/ MPI_Get_address(&I, &(disp[0])); MPI_Get_address(c,&(disp[1])); MPI_Type_create_struct(2, blocklen, disp, type, &Type); MPI_Type_commit(&Type); /* Send */ MPI_Send(MPI_BOTTOM,1,Type,1,0,MPI_COMM_WORLD); Struct datatype int I; char c[100]; char buffer[110]; int position = 0; /* Pack */ MPI_Pack(&i,1,MPI_INT, buffer, 110,&position,MPI_COMM_WORLD); MPI_Pack(c,100,MPI_CHAR,buffer,110,&position,MPI_COMM_WORLD); /*send*/ MPI_Send(buffer,position,MPI_PACKED,1,0,MPI_COMM_WORLD); Packed datatype
Receiving & unpacking packed messages • Note MPI_PACKED datatype can be used to receive any message • To unpack, make succesful calls to MPI_Unpack(inbuf,insize,position,outbuf,outcount, datatype,comm,ierr) • First call starts at position 0 • Second call starts at the position returned by the first MPI_Unpack (value returned by position variable)
MPI_Unpack example int I; char c[100]; MPI_Status status; char buffer[110]; int position = 0; /* receive */ MPI_Recv(buffer,110,MPI_PACKED,1,0,MPI_COMM_WORLD, &status); /*unpack*/ MPI_Unpack(buffer,110,&position,&i,1,MPI_INT,MPI_COMM_WORLD); MPI_Unpack(buffer,110,&position,c,100,MPI_CHAR,MPI_COMM_WORLD);
Note on implementation • A packing unit may contain metadata (recall issue of heterogeneity in communicator) • e.g. 64 processors of standard type, 1 additional graphics processor of opposite endian • Any header will contain information on encoding and the size of the unit for error checking • Implies you can’t concatenate two packed messages and send in a single operation • Similarly can’t separate an incoming message into two sections and unpack separately
Applicable to applications where data structures reflect a specific domain decomposition Optional attribute that can be given to an intracommunicator only – not to intercommunicators May possibly help runtime mapping of processes onto hardware We’ll examine one code example that brings together a number of features we have discussed and places them in context Topologies, Cartesian Primitives and messaging performance
Inadequacies of linear ranking • Linear ranking implies a ring-like abstract topology for the processes • Few applications have a communication pattern that logically reflects a ring • Two or three dimensional grids/torii are very common • We define the logical processes arrangement defined by the communication pattern the “virtual topology” • Do not confuse the virtual topology with the machine topology • The virtual topology must be mapped to the hardware via an embedding • This may or may not be beneficial
Graphs & Virtual Topologies Cartesian example • Communication pattern defines a graph • Communication graphs are assumed to be symmetric • if a communicates with b, b can communicate with a • Communicators allow messaging between any pair of nodes • virtual topologies may necessarily neglect some communication modes 3x4 2d virtual topology. Upper number is the rank, lower Value gives (column,row).
Specific example: Poisson Equation on 2d grid • xi=i/n+1, where n defines number of points, same in y axis • On unit square grid spacing h=1/n+1 • Equation is approximated by • Solve by guessing for u initially, then iteratively solve for u (Jacobi iteration) interior boundary i,j+1 i,j i-1,j i+1,j i,j-1
2d Poisson continued Rank=1 • Calculation of the updated uk involves looping over all grid points • How do we decompose the problem in parallel? • Simplest choice: 1-d decomposition (slabs) • Solid lines denote data stored on node • Dotted lines denote all required data required for calculation • Nodes will always require data that they do not hold a copy of locally – ghost cells Rank=0
Changes to the loop structure Single processor Parallel Node • Loop is now over node start and end points • Data arrays must include space for j+1 and j-1 entries in the y direction • Ghost data must be communicated • MPI_Cart topologies create the necessary facilities for this messaging to be done quickly and easily integer i,j,n double precision u(0:n+1,s-1:e+1) double precision unew(0:n+1,s-1:e+1) do j=s,e do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do integer i,j,n double precision u(0:n+1,0:n+1) double precision unew(0:n+1,0:n+1) do j=1,n do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do
MPI_Cart_create • Creates a communicator with a Cartesian topology • 2nd through 5th arguments define the new topology. 1st argument is the initial communicator, 6th argument is the new communicator • Example for a logical 2d grid of processors: integer dims(2) logical isperiodic(2),reorder dims(1)=4 dims(2)=3 isperiodic(1)=.false. isperiodic(2)=.false. reorder=.true. ndim=2 call MPI_Cart_create(MPI_COMM_WORLD, + ndim,dims,isperiodic,reorder, + comm2d,ierr) Dimensions of axes Are axis directions periodic? (consider grid on the Earth for example) Rank in new communicator may be different from old (may improve performance)
Important issue • The topology MPI_Cart_create initializes is related to the communication required by the domain decomposition • Not the implicit communication structure of the algorithm • For the 1d decomposition all x values are contained on node – only y boundaries need be exhanged • In the 2d Poisson example we can choose either a 1d or 2d decomposition
Finding your coordinates & neighbours in the new communicator • MPI_Cart_get(comm1d,1,dims,isperiodic,coords,ierr) • Returns dims, isperiodic as defined in the communicator • Calling processes coords are returned • You may also want to convert a rank to coordinates • MPI_Cart_coords(comm1d,rank,maxdims,coords,ierr) • If you call with your own rank then call returns your coords • When moving data we frequently wish to move across one axis direction synchronously across all processors • If you just want to send and receive ghost cell values to/from neighbours there is a simple mechanism for determining the neighbouring ranks • MPI_Cart_shift
MPI_Cart_shift Rank=1 • MPI_Cart_shift(comm,direction,shift,src,dest,ierr) • Input your rank • Input direction to shift (note numbered from 0 to n-1) • Input displacement • >0=upward shift • <0=downward • Returns rank of neighbour in upper direction (target) • Returns rank of neighbour in lower direction (source) • If no neighbour, MPI_PROC_NULL is returned • Message to this rank is ignored, and completes immediately Rank=0 Boundary exchanges, data must move upward and downward.
Remaining ingredients • Still need to compute the start and end points for each node • If n is divisible by nprocs (best for load balance): • When nprocs does not divide n naïve solution: • Floor(x) returns largest integer value no greater than x • Better to use MPE_Decomp1d to even out, but load balance still won’t be perfect (see the ANL/MSU MPI website) s = 1 + myrank * (n/nprocs) e = s + (n / nprocs) -1 s = 1 + myrank * (n/nprocs) if (myrank .eq. nprocs-1) then e = n else e = s + floor(n / nprocs) -1 end if
http://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htmhttp://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htm Prototype code call MPI_CART_CREATE( MPI_COMM_WORLD, 1, numprocs, .false., $ .true., comm1d, ierr ) c c Get my position in this communicator, and my neighbors c call MPI_COMM_RANK( comm1d, myid, ierr ) call MPI_Cart_shift( comm1d, 0, 1, nbrbottom, nbrtop, ierr ) c Compute the actual decomposition call MPE_DECOMP1D( ny, numprocs, myid, s, e ) c Initialize the right-hand-side (f) and the initial solution guess (a) call onedinit( a, b, f, nx, s, e ) c c Actually do the computation. Note the use of a collective operation to c check for convergence, and a do-loop to bound the number of iterations. c do 10 it=1, maxit call exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( a, f, nx, s, e, b ) ! Jacobi sweep call exchng1( b, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( b, f, nx, s, e, a ) dwork = diff( a, b, nx, s, e ) ! Convergence test call MPI_Allreduce( dwork, diffnorm, 1, MPI_DOUBLE_PRECISION, $ MPI_SUM, comm1d, ierr ) if (diffnorm .lt. 1.0e-5) goto 20 c if (myid .eq. 0) print *, 2*it, ' Difference is ', diffnorm 10 continue if (myid .eq. 0) print *, 'Failed to converge' 20 continue if (myid .eq. 0) then print *, 'Converged after ', 2*it, ' Iterations’ endif Routine alternately computes into a and then b – hence 2*it at end
Which routine determines performance (other than comp. kernel)? • exchng1 controls the messaging and how this is coded will strongly affect overall execution time • Recall from last lecture, 5(!) different ways to do communication: • Blocking sends and receives • very bad idea, potentially unsafe (although in this case it is OK) • Ordered send • match send and receive perfectly • Sendrecv • systems deals with buffering • Buffered Bsend • Nonblocking Isend
Option #1: Simple sends and receives subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr C call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) return end
Option #2: Matched sends and receives subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) use mpi integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop, rank, coord integer status(MPI_STATUS_SIZE), ierr ! call MPI_COMM_RANK( comm1d, rank, ierr ) call MPI_CART_COORDS( comm1d, rank, 1, coord, ierr ) if (mod( coord, 2 ) .eq. 0) then call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) else call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) endif return end
Option #3: Sendrecv subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr c call MPI_SENDRECV( & a(1,e), nx, MPI_DOUBLE_PRECISION, nbrtop, 0, & a(1,s-1), nx, MPI_DOUBLE_PRECISION, nbrbottom, 0, & comm1d, status, ierr ) call MPI_SENDRECV( & a(1,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, 1, & a(1,e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, 1, & comm1d, status, ierr ) return end
Option #4: Buffered Bsend subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) use mpi integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr call MPI_BSEND( a(1,e), nx, MPI_DOUBLE_PRECISION, nbrtop, & 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, nbrbottom, & 0, comm1d, status, ierr ) call MPI_BSEND( a(1,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, & 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, & 1, comm1d, status, ierr ) return end Must also provide buffer space and connected to it via MPI_BUFFER_ATTACH (see Using MPI book)
Option #5: Nonblocking Isend subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status_array(MPI_STATUS_SIZE,4), ierr, req(4) C call MPI_IRECV ( & a(1,s-1), nx, MPI_DOUBLE_PRECISION, nbrbottom, 0, & comm1d, req(1), ierr ) call MPI_IRECV ( & a(1,e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, 1, & comm1d, req(2), ierr ) call MPI_ISEND ( & a(1,e), nx, MPI_DOUBLE_PRECISION, nbrtop, 0, & comm1d, req(3), ierr ) call MPI_ISEND ( & a(1,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, 1, & comm1d, req(4), ierr ) C call MPI_WAITALL ( 4, req, status_array, ierr ) return end
(Problem size chosen to be large enough to decompose.) Comparison of code speed for the different communication methods Execution time Blocking send is by far the worst example – what happened? Ncpu
P5 P0 P1 P2 P3 P4 SEND SEND SEND SEND SEND RECV RECV RECV RECV RECV What happens in the blocking sending version? • The blocking communication can create a serial cascade • “Top process” sends a message to a null process which completes • It can then receive from another process • The ability to receive then cascades through the process ranks Time Completely serial communication
Observed scaling • Even non-blocking sends do not scale better than 20% efficiency on 64 nodes – why? Speed-up Ncpu
Scalability Analysis • Examine execution time to see if time in communication is a bottleneck • Time to send n bytes: Timecomm=s+r*n • s = start-up overhead • r = time to send one byte (~1/bandwidth) • Time in exchng1d routine ~ 2(s+r*n) • n=8*nx for double precision data • Communication time per process is fixed regardless of number of processors! • Communication doesn’t scale
2-d topology communication time • Excluding processors on domain edge, then exchange of x boundary, followed by exchange of y boundary • For a square, with p total processors • Communication time decreases as a function of 1/sqrt(p) – moderately scalable • Note for a small number of processors 1d decomposition is actually better
Changes required for the 2-d domain decomposition • Need to update process topology • MPI_Cart_create(MPI_COMM_WORLD,2,.. • Need two calls to MPI_Cart_shift • One for each axis direction • Body of sweep routine must change loop limits • Now have limits in x direction • Arrays must also be updated for smaller sizes • Lastly, exchng1 must be updated to exchng2 • Boundary exchanges in both directions
2d: User derived datatype for boundary exchanges • x-boundary consists of non-contiguous memory locations • Unavoidable due to 2d array being mapped into memory • Use MPI_Type_vector discussed previously
Solution • Number of blocks = (y end)-(y start)+1=ey-sy+3 • e.g. if end=4, start=2 there are 3 entries • 1 entry per block • Stride = (x end)-(x start)+1 = ex-sx+3 • Remember it is a displacement between entries • Create and commit: call MPI_Type_vector(ey-sy+3,1,ex-sx+3, & MPI_DOUBLE_PRECISION, stridetype,ierr) call MPI_Type_commit(stridetype,ierr)
exchng2d subroutine exchng2( a, sx, ex, sy, ey, & comm2d, stridetype, & nbrleft, nbrright, nbrtop, nbrbottom ) include 'mpif.h' integer sx, ex, sy, ey, stridetype double precision a(sx-1:ex+1, sy-1:ey+1) integer nbrleft, nbrright, nbrtop, nbrbottom, comm2d integer status(MPI_STATUS_SIZE), ierr, nx c nx = ex - sx + 1 c These are just like the 1-d versions, except for less data call MPI_SENDRECV( a(sx,ey), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, & a(sx,sy-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm2d, status, ierr ) call MPI_SENDRECV( a(sx,sy), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, & a(sx,ey+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm2d, status, ierr ) c c This uses the vector datatype stridetype call MPI_SENDRECV( a(ex,sy), 1, stridetype, nbrright, 0, & a(sx-1,sy), 1, stridetype, nbrleft, 0, & comm2d, status, ierr ) call MPI_SENDRECV( a(sx,sy), 1, stridetype, nbrleft, 1, & a(ex+1,sy), 1, stridetype, nbrright, 1, & comm2d, status, ierr ) return end
Overlapping communication and computation • The expensive nature of communication argues for overlapping (pipelining!) it with computation • Increasing divide between CPU speed and message latency means this will only be more important in the future • Not always achievable – e.g. master-slave task decomposition • Nonblocking communication facilitates this kind of overlap • 2d Poisson example • Interior of arrays can be calculated without the ghost cells
Computation that can be computed without ghost cells • All blue points are contained on node • Red points denote ghost cells • Blue points contained within the rectangle use only local data 2d domain decomposition
Steps in comms/comp overlapping version • Indicate where data from other processes is to be received • Post non-blocking sends to other processes • Perform local computation • Check for arrival of data from other processes • Complete remaining computation Final algorithm is more complicated than the non-overlapping version, but all the necessary code has already been written. This algorithmic structure is used universally to overlap comms & comps.
Structure of computation of final pieces • Post 4 MPI_Irecv’s and 4 MPI_Isends • Handles from the 4 Irecv’s determine which part of the remaining cells to compute • Use MPI_WAITANY to test on whether operations have completed • Ignore completion of sends (but must still ensure they have completed) do 100 k=1,8 call MPI_WAITANY( 8, requests, idx, status, ierr ) ! Use tag to determine which edge if (idx.eq.1) then do j=sy,ey ! Left hand cells unew(si,j) = ... end do else if (idx.eq.2) then do j=sy,ey ! Right hand cells unew(ei,j) = ... end do else if (idx.eq.3) then do i=sx,ex ! Upper cells unew(i,ej) = ... end do else if (idx.eq.4) then do i=sx,ex ! Lower cells unew(i,sj) = ... end do end if end do
Note about overlapping comms and comps • Nonblocking operations were introduced first and foremost to allow the writing of safe programs • Many implementations cannot physically overlap without additional hardware • Communication processor or system communication thread • Thus nonblocking operations may allow the overlapping, but they are not required to • If code that overlaps comms/comps does not increase performance it may not be your fault
3d • Problem sizes scale cubically in three dimensions • Number of grids cells frequently exceeds 107 • Large amount of work makes 3d codes natural candidates for parallelization • Domain decomposition is expected to best for a 3d virtual topology • MPI_Dims_create(nodes,ndims,dims) divides up nodes into best load balanced topology • e.g. MPI_Dims_create(6,3,dims) gives dims=(3,2,1) • Values of dims may be set before the call, e.g. dims=(0,3,0) then MPI_Dims_create(6,3,dims) gives dims=(2,3,1) • If nodes is not divisible then an error is returned