510 likes | 563 Views
Today's lecture covers distributed memory computing in Astrophysics, focusing on user-defined datatypes, packing, and communicators. Topics include abstract topologies, Cartesian primitives, performance, scaling, building defined types, and working with MPI data types. Learn how to create new datatypes, pack data efficiently, and optimize message passing. Gain insights into derived datatypes, contiguous types, vector types, and indexed types. Examples and practical applications are explored to enhance your understanding of MPI data processing techniques.
E N D
Computational Methods in Astrophysics Dr Rob Thacker (AT319E) thacker@ap
Today’s Lecture Distributed Memory Computing II • User defined (derived) datatypes, packing, communicators • Abstract topologies, Cartesian primitives, performance and scaling
Build defined types from built-in • MPI comes with plenty of built-in datatypes
Making a trivial new datatype • MPI_Type_contiguous(n,oldtype,newtype,ierror) • Produces a new datatype that is n copies of the old • Each entry is displaced by the extent of oldtype • oldtype can be a derived datatype as well • Must “commit” type before sending • Use MPI_Type_commit(newtype,ierr) • Free up with MPI_Type_Free(newtype,ierr) – finite number of datatypes MPI_Send(buffer,count,datatype,dest,tag,comm,ierr) Equivalent calls MPI_Type_contiguous(count,datatype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)
Arbitrary datatypes • To derive a datatype from another - • Specify a typemap of datatypes and BYTE displacements • The type signature is the list of basic datatypes within the new datatype: • Type signature directly controls the interpretation of both sent and received data, displacements indicate where to find information • Decodes information contained in the buffers
Vector datatypes • MPI_Type_vector(count, blocklength, stride, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Each block is blocklength data items long • Separation between successive starting blocks is stride • Function MPI_Type_contiguous can be thought of as a special case of MPI_Type_vector: • MPI_Type_contiguous(count, oldtype, newtype, ierr) = MPI_Type_vector(count, 1, 1, oldtype, newtype,ierr)
Vector Example • Let oldtype = {(double, 0), (char, 8)} (named_double) • Call MPI_Type_vector(2, 3, 4, named_double, six_named_doubles,ierr) • New MPI data type, which has the following map: • newtype = {(double, 0), (char, 8), (double, 16), (char, 24), (double, 32), (char, 40), (double, 64), (char, 72), (double, 80), (char, 88), (double, 96), (char, 104)} • Two blocks of data are taken, each containing 3 structures of the type named_double • The 3 structures within the block are concatenated contiguously • The stride is set to extent of 4 objects of the same type = 64 bytes, since each named_double must have an extent of 16 bytes (word alignment)
Indexed data types • MPI_Type_indexed(count,array_of_blocklengths, array_of_displacements, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Block n is array_of_blocklengths(n) data items long • List of displacements is defined relative to the first item • MPI_Type_vector can be thought of as a special case of MPI_Type_indexed • All array_of_blocklengths entries = vector blocklen • List of displacements are multiples of the stride
Example of using indexing • Suppose you only need a subset of an array • Could send many short messages • Could pack a datatype • Better solution create a datatype the includes the displacements you need • Example: • List of offsets: 9,27,39,142 9 27 39 142
Example cont • In this case we have 4 elements (=n_to_move) • Array of displacements (offset()) • Each part of the list is a single element • `Block’ lengths are all unity (blocksize()) • List datatype will be denoted ltype • Following code performs the implicit packing for you: MPI_Type_indexed(n_to_move,blocksize,offset,ltype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)
When to use H versions • Data maybe stored in dynamically allocated region • No single buffer to refer displacements from • MPI_Address(location,address,ierr) will return the byte address of the element location within an allocated window • Create list of byte positions for Hindexed using MPI_Address • MPI_BOTTOM can then be used as the starting address for the MPI_Send • MPI_Send(MPI_BOTTOM,1,mytype,dest,tag,comm,ierr)
Packing • On some occasions packing may be desirable over specifying a datatype • e.g. packing across multiple arrays • MPI_Pack allows a user to incrementally add data into a user provided buffer • MPI_Pack(inbuf,incount,datatype,outbuf,outcount,position,comm,ierr) • Packed data is given the MPI_PACKED datatype • MPI_Pack_size can be used to determine how large the required buffer is in bytes • Specify number of elements, MPI datatype, communicator • Communicator is required since heterogeneous environment may require more packing than expected
Equivalent examples – both datatypes send the same message int I; char c[100]; MPI_Aint disp[2]; int blocklen[2] = {1, 100}; MPI_Datatype type[2] = {MPI_INT, MPI_CHAR}; MPI_Datatype Type; /*create datatype*/ MPI_Get_address(&I, &(disp[0])); MPI_Get_address(c,&(disp[1])); MPI_Type_create_struct(2, blocklen, disp, type, &Type); MPI_Type_commit(&Type); /* Send */ MPI_Send(MPI_BOTTOM,1,Type,1,0,MPI_COMM_WORLD); Struct datatype int I; char c[100]; char buffer[110]; int position = 0; /* Pack */ MPI_Pack(&i,1,MPI_INT, buffer, 110,&position,MPI_COMM_WORLD); MPI_Pack(c,100,MPI_CHAR,buffer,110,&position,MPI_COMM_WORLD); /*send*/ MPI_Send(buffer,position,MPI_PACKED,1,0,MPI_COMM_WORLD); Packed datatype
Receiving & unpacking packed messages • Note MPI_PACKED datatype can be used to receive any message • To unpack, make succesful calls to MPI_Unpack(inbuf,insize,position,outbuf,outcount, datatype,comm,ierr) • First call starts at position 0 • Second call starts at the position returned by the first MPI_Unpack (value returned by position variable)
MPI_Unpack example int I; char c[100]; MPI_Status status; char buffer[110]; int position = 0; /* receive */ MPI_Recv(buffer,110,MPI_PACKED,1,0,MPI_COMM_WORLD, &status); /*unpack*/ MPI_Unpack(buffer,110,&position,&i,1,MPI_INT,MPI_COMM_WORLD); MPI_Unpack(buffer,110,&position,c,100,MPI_CHAR,MPI_COMM_WORLD);
Note on implementation • A packing unit may contain metadata (recall issue of heterogeneity in communicator) • e.g. 64 processors of standard type, 1 additional graphics processor of opposite endian • Any header will contain information on encoding and the size of the unit for error checking • Implies you can’t concatenate two packed messages and send in a single operation • Similarly can’t separate an incoming message into two sections and unpack separately
Applicable to applications where data structures reflect a specific domain decomposition Optional attribute that can be given to an intracommunicator only – not to intercommunicators May possibly help runtime mapping of processes onto hardware We’ll examine one code example that brings together a number of features we have discussed and places them in context Topologies, Cartesian Primitives and messaging performance
Inadequacies of linear ranking • Linear ranking implies a ring-like abstract topology for the processes • Few applications have a communication pattern that logically reflects a ring • Two or three dimensional grids/torii are very common • We define the logical processes arrangement defined by the communication pattern the “virtual topology” • Do not confuse the virtual topology with the machine topology • The virtual topology must be mapped to the hardware via an embedding • This may or may not be beneficial
Graphs & Virtual Topologies Cartesian example • Communication pattern defines a graph • Communication graphs are assumed to be symmetric • if a communicates with b, b can communicate with a • Communicators allow messaging between any pair of nodes • virtual topologies may necessarily neglect some communication modes 3x4 2d virtual topology. Upper number is the rank, lower Value gives (column,row).
Specific example: Poisson Equation on 2d grid • xi=i/n+1, where n defines number of points, same in y axis • On unit square grid spacing h=1/n+1 • Equation is approximated by • Solve by guessing for u initially, then iteratively solve for u (Jacobi iteration) interior boundary i,j+1 i,j i-1,j i+1,j i,j-1
2d Poisson continued Rank=1 • Calculation of the updated uk involves looping over all grid points • How do we decompose the problem in parallel? • Simplest choice: 1-d decomposition (slabs) • Solid lines denote data stored on node • Dotted lines denote all required data required for calculation • Nodes will always require data that they do not hold a copy of locally – ghost cells Rank=0
Changes to the loop structure Single processor Parallel Node • Loop is now over node start and end points • Data arrays must include space for j+1 and j-1 entries in the y direction • Ghost data must be communicated • MPI_Cart topologies create the necessary facilities for this messaging to be done quickly and easily integer i,j,n double precision u(0:n+1,s-1:e+1) double precision unew(0:n+1,s-1:e+1) do j=s,e do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do integer i,j,n double precision u(0:n+1,0:n+1) double precision unew(0:n+1,0:n+1) do j=1,n do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do
MPI_Cart_create • Creates a communicator with a Cartesian topology • 2nd through 5th arguments define the new topology. 1st argument is the initial communicator, 6th argument is the new communicator • Example for a logical 2d grid of processors: integer dims(2) logical isperiodic(2),reorder dims(1)=4 dims(2)=3 isperiodic(1)=.false. isperiodic(2)=.false. reorder=.true. ndim=2 call MPI_Cart_create(MPI_COMM_WORLD, + ndim,dims,isperiodic,reorder, + comm2d,ierr) Dimensions of axes Are axis directions periodic? (consider grid on the Earth for example) Rank in new communicator may be different from old (may improve performance)
Important issue • The topology MPI_Cart_create initializes is related to the communication required by the domain decomposition • Not the implicit communication structure of the algorithm • For the 1d decomposition all x values are contained on node – only y boundaries need be exhanged • In the 2d Poisson example we can choose either a 1d or 2d decomposition
Finding your coordinates & neighbours in the new communicator • MPI_Cart_get(comm1d,1,dims,isperiodic,coords,ierr) • Returns dims, isperiodic as defined in the communicator • Calling processes coords are returned • You may also want to convert a rank to coordinates • MPI_Cart_coords(comm1d,rank,maxdims,coords,ierr) • If you call with your own rank then call returns your coords • When moving data we frequently wish to move across one axis direction synchronously across all processors • If you just want to send and receive ghost cell values to/from neighbours there is a simple mechanism for determining the neighbouring ranks • MPI_Cart_shift
MPI_Cart_shift Rank=1 • MPI_Cart_shift(comm,direction,shift,src,dest,ierr) • Input your rank • Input direction to shift (note numbered from 0 to n-1) • Input displacement • >0=upward shift • <0=downward • Returns rank of neighbour in upper direction (target) • Returns rank of neighbour in lower direction (source) • If no neighbour, MPI_PROC_NULL is returned • Message to this rank is ignored, and completes immediately Rank=0 Boundary exchanges, data must move upward and downward.
Remaining ingredients • Still need to compute the start and end points for each node • If n is divisible by nprocs (best for load balance): • When nprocs does not divide n naïve solution: • Floor(x) returns largest integer value no greater than x • Better to use MPE_Decomp1d to even out, but load balance still won’t be perfect (see the ANL/MSU MPI website) s = 1 + myrank * (n/nprocs) e = s + (n / nprocs) -1 s = 1 + myrank * (n/nprocs) if (myrank .eq. nprocs-1) then e = n else e = s + floor(n / nprocs) -1 end if
http://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htmhttp://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htm Prototype code call MPI_CART_CREATE( MPI_COMM_WORLD, 1, numprocs, .false., $ .true., comm1d, ierr ) c c Get my position in this communicator, and my neighbors c call MPI_COMM_RANK( comm1d, myid, ierr ) call MPI_Cart_shift( comm1d, 0, 1, nbrbottom, nbrtop, ierr ) c Compute the actual decomposition call MPE_DECOMP1D( ny, numprocs, myid, s, e ) c Initialize the right-hand-side (f) and the initial solution guess (a) call onedinit( a, b, f, nx, s, e ) c c Actually do the computation. Note the use of a collective operation to c check for convergence, and a do-loop to bound the number of iterations. c do 10 it=1, maxit call exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( a, f, nx, s, e, b ) ! Jacobi sweep call exchng1( b, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( b, f, nx, s, e, a ) dwork = diff( a, b, nx, s, e ) ! Convergence test call MPI_Allreduce( dwork, diffnorm, 1, MPI_DOUBLE_PRECISION, $ MPI_SUM, comm1d, ierr ) if (diffnorm .lt. 1.0e-5) goto 20 c if (myid .eq. 0) print *, 2*it, ' Difference is ', diffnorm 10 continue if (myid .eq. 0) print *, 'Failed to converge' 20 continue if (myid .eq. 0) then print *, 'Converged after ', 2*it, ' Iterations’ endif Routine alternately computes into a and then b – hence 2*it at end
Which routine determines performance (other than comp. kernel)? • exchng1 controls the messaging and how this is coded will strongly affect overall execution time • Recall from last lecture, 5(!) different ways to do communication: • Blocking sends and receives • very bad idea, potentially unsafe (although in this case it is OK) • Ordered send • match send and receive perfectly • Sendrecv • systems deals with buffering • Buffered Bsend • Nonblocking Isend
Option #1: Simple sends and receives subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr C call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) return end
Option #2: Matched sends and receives subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) use mpi integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop, rank, coord integer status(MPI_STATUS_SIZE), ierr ! call MPI_COMM_RANK( comm1d, rank, ierr ) call MPI_CART_COORDS( comm1d, rank, 1, coord, ierr ) if (mod( coord, 2 ) .eq. 0) then call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) else call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) endif return end
Option #3: Sendrecv subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr c call MPI_SENDRECV( & a(1,e), nx, MPI_DOUBLE_PRECISION, nbrtop, 0, & a(1,s-1), nx, MPI_DOUBLE_PRECISION, nbrbottom, 0, & comm1d, status, ierr ) call MPI_SENDRECV( & a(1,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, 1, & a(1,e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, 1, & comm1d, status, ierr ) return end
Option #4: Buffered Bsend subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) use mpi integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr call MPI_BSEND( a(1,e), nx, MPI_DOUBLE_PRECISION, nbrtop, & 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, nbrbottom, & 0, comm1d, status, ierr ) call MPI_BSEND( a(1,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, & 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, & 1, comm1d, status, ierr ) return end Must also provide buffer space and connected to it via MPI_BUFFER_ATTACH (see Using MPI book)
Option #5: Nonblocking Isend subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status_array(MPI_STATUS_SIZE,4), ierr, req(4) C call MPI_IRECV ( & a(1,s-1), nx, MPI_DOUBLE_PRECISION, nbrbottom, 0, & comm1d, req(1), ierr ) call MPI_IRECV ( & a(1,e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, 1, & comm1d, req(2), ierr ) call MPI_ISEND ( & a(1,e), nx, MPI_DOUBLE_PRECISION, nbrtop, 0, & comm1d, req(3), ierr ) call MPI_ISEND ( & a(1,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, 1, & comm1d, req(4), ierr ) C call MPI_WAITALL ( 4, req, status_array, ierr ) return end
(Problem size chosen to be large enough to decompose.) Comparison of code speed for the different communication methods Execution time Blocking send is by far the worst example – what happened? Ncpu
P5 P0 P1 P2 P3 P4 SEND SEND SEND SEND SEND RECV RECV RECV RECV RECV What happens in the blocking sending version? • The blocking communication can create a serial cascade • “Top process” sends a message to a null process which completes • It can then receive from another process • The ability to receive then cascades through the process ranks Time Completely serial communication
Observed scaling • Even non-blocking sends do not scale better than 20% efficiency on 64 nodes – why? Speed-up Ncpu
Scalability Analysis • Examine execution time to see if time in communication is a bottleneck • Time to send n bytes: Timecomm=s+r*n • s = start-up overhead • r = time to send one byte (~1/bandwidth) • Time in exchng1d routine ~ 2(s+r*n) • n=8*nx for double precision data • Communication time per process is fixed regardless of number of processors! • Communication doesn’t scale
2-d topology communication time • Excluding processors on domain edge, then exchange of x boundary, followed by exchange of y boundary • For a square, with p total processors • Communication time decreases as a function of 1/sqrt(p) – moderately scalable • Note for a small number of processors 1d decomposition is actually better
Changes required for the 2-d domain decomposition • Need to update process topology • MPI_Cart_create(MPI_COMM_WORLD,2,.. • Need two calls to MPI_Cart_shift • One for each axis direction • Body of sweep routine must change loop limits • Now have limits in x direction • Arrays must also be updated for smaller sizes • Lastly, exchng1 must be updated to exchng2 • Boundary exchanges in both directions
2d: User derived datatype for boundary exchanges • x-boundary consists of non-contiguous memory locations • Unavoidable due to 2d array being mapped into memory • Use MPI_Type_vector discussed previously
Solution • Number of blocks = (y end)-(y start)+1=ey-sy+3 • e.g. if end=4, start=2 there are 3 entries • 1 entry per block • Stride = (x end)-(x start)+1 = ex-sx+3 • Remember it is a displacement between entries • Create and commit: call MPI_Type_vector(ey-sy+3,1,ex-sx+3, & MPI_DOUBLE_PRECISION, stridetype,ierr) call MPI_Type_commit(stridetype,ierr)
exchng2d subroutine exchng2( a, sx, ex, sy, ey, & comm2d, stridetype, & nbrleft, nbrright, nbrtop, nbrbottom ) include 'mpif.h' integer sx, ex, sy, ey, stridetype double precision a(sx-1:ex+1, sy-1:ey+1) integer nbrleft, nbrright, nbrtop, nbrbottom, comm2d integer status(MPI_STATUS_SIZE), ierr, nx c nx = ex - sx + 1 c These are just like the 1-d versions, except for less data call MPI_SENDRECV( a(sx,ey), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, & a(sx,sy-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm2d, status, ierr ) call MPI_SENDRECV( a(sx,sy), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, & a(sx,ey+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm2d, status, ierr ) c c This uses the vector datatype stridetype call MPI_SENDRECV( a(ex,sy), 1, stridetype, nbrright, 0, & a(sx-1,sy), 1, stridetype, nbrleft, 0, & comm2d, status, ierr ) call MPI_SENDRECV( a(sx,sy), 1, stridetype, nbrleft, 1, & a(ex+1,sy), 1, stridetype, nbrright, 1, & comm2d, status, ierr ) return end
Overlapping communication and computation • The expensive nature of communication argues for overlapping (pipelining!) it with computation • Increasing divide between CPU speed and message latency means this will only be more important in the future • Not always achievable – e.g. master-slave task decomposition • Nonblocking communication facilitates this kind of overlap • 2d Poisson example • Interior of arrays can be calculated without the ghost cells
Computation that can be computed without ghost cells • All blue points are contained on node • Red points denote ghost cells • Blue points contained within the rectangle use only local data 2d domain decomposition
Steps in comms/comp overlapping version • Indicate where data from other processes is to be received • Post non-blocking sends to other processes • Perform local computation • Check for arrival of data from other processes • Complete remaining computation Final algorithm is more complicated than the non-overlapping version, but all the necessary code has already been written. This algorithmic structure is used universally to overlap comms & comps.
Structure of computation of final pieces • Post 4 MPI_Irecv’s and 4 MPI_Isends • Handles from the 4 Irecv’s determine which part of the remaining cells to compute • Use MPI_WAITANY to test on whether operations have completed • Ignore completion of sends (but must still ensure they have completed) do 100 k=1,8 call MPI_WAITANY( 8, requests, idx, status, ierr ) ! Use tag to determine which edge if (idx.eq.1) then do j=sy,ey ! Left hand cells unew(si,j) = ... end do else if (idx.eq.2) then do j=sy,ey ! Right hand cells unew(ei,j) = ... end do else if (idx.eq.3) then do i=sx,ex ! Upper cells unew(i,ej) = ... end do else if (idx.eq.4) then do i=sx,ex ! Lower cells unew(i,sj) = ... end do end if end do
Note about overlapping comms and comps • Nonblocking operations were introduced first and foremost to allow the writing of safe programs • Many implementations cannot physically overlap without additional hardware • Communication processor or system communication thread • Thus nonblocking operations may allow the overlapping, but they are not required to • If code that overlaps comms/comps does not increase performance it may not be your fault
3d • Problem sizes scale cubically in three dimensions • Number of grids cells frequently exceeds 107 • Large amount of work makes 3d codes natural candidates for parallelization • Domain decomposition is expected to best for a 3d virtual topology • MPI_Dims_create(nodes,ndims,dims) divides up nodes into best load balanced topology • e.g. MPI_Dims_create(6,3,dims) gives dims=(3,2,1) • Values of dims may be set before the call, e.g. dims=(0,3,0) then MPI_Dims_create(6,3,dims) gives dims=(2,3,1) • If nodes is not divisible then an error is returned