550 likes | 559 Views
This lecture discusses the patterns of communication required to implement various constructs in High Performance Fortran, along with introducing libraries that support these communication patterns.
E N D
Communication in Data Parallel Languages Bryan Carpenter NPAC at Syracuse University Syracuse, NY 13244 dbc@npac.syr.edu
Goals of this lecture • Discuss patterns of communication needed to implement various constructs in High Performance Fortran. • Introduce some libraries that have been developed to support these communication patterns.
Contents of Lecture • Patterns of communication • Regular communications • Irregular array accesses. • “Unscheduled” access to remote data. • Libraries for distributed array communication • CHAOS/PARTI • Adlib
I. Patterns of communication • Last lecture gave translations of simple fragments of HPF—involved no inter-processor communication. • Realistic HPF programs translate to SPMD programs with communications for accessing array elements on other processors. • Will take a pattern-oriented look at the required communications.
Classifying communication patterns • Will discuss 5 situations: • Array assignments • Stencil problems • Reductions and transformational intrinsics • General subscripting in array-parallel code • Accessing remote data in task parallel code • Don’t claim these are exhaustive, but cover many of the cases arising in useful parallel programs.
1. Array assignments • Variants of the assignment: A = B hide many communication patterns. • A and B may be any conforming array sections. • Will see that communication patterns required encompass most of those occurring in collective communications of MPI.
Array assignment with neighbor communication !HPF$ PROCESSORS P(4) REAL A(50), B(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P !HPF$ DISTRIBUTE B(BLOCK) ONTO P A(1:49) = B(2:50) • In emitted SPMD program, communications might be implemented with MPI_SEND, MPI_RECV, or MPI_SENDRECV.
Array assignment with gather communication !HPF$ PROCESSORS P(4) REAL A(50, 50), B(50) !HPF$ DISTRIBUTE A(*, BLOCK) ONTO P !HPF$ DISTRIBUTE B(BLOCK) ONTO P A(:, 1) = B • In emitted SPMD program, communications might be implemented with MPI_GATHER. • If assignment direction reversed, might use MPI_SCATTER.
Array assignment gathering to all !HPF$ PROCESSORS P(4) REAL A(50, 50), B(50), C(50) !HPF$ DISTRIBUTE A(*, BLOCK) ONTO P !HPF$ DISTRIBUTE B(BLOCK) ONTO P !HPF$ ALIGN C(I) WITH A(I, *) C = B • C has replicated, collapsed alignment. • Communication in emitted SPMD program, might be implemented with MPI_ALLGATHER.
Array assignments with all-to-all communication REAL A(50, 50), B(50, 50) !HPF$ DISTRIBUTE A(*, BLOCK) ONTO P !HPF$ DISTRIBUTE B(BLOCK, *) ONTO P A = B • or REAL A(50), B(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P !HPF$ DISTRIBUTE B(CYCLIC) ONTO P A = B • Communication in emitted SPMD program, might be implemented with MPI_ALLTOALL.
2. Stencil problems • Updates where element updated in terms of fixed “footprint” of neighboring elements. • Arise in solution of PDEs, cellular automata, image processing…
Famous example: Jacobi relaxation for Laplace FORALL (I = 2:N-1, J = 2:N-1) & U(I, J) = 0.25 * (U(I, J-1) + U(I, J+1) + & U(I-1, J) + U(I+1, J)) • Can be recast in terms of array assignment: U(2:N-1, 2:N-1) = & 0.25 * (U(2:N-1, 1:N-2) + U(2:N-1, 3:N) & + U(1:N-2, 2:N-1) + U(3:N, 2:N-1))
Array assignment version of Laplace solver • Introduce temporaries T1, …, T4 aligned to section U(2:N-1, 2:N-1). Then: T1 = U(2:N-1, 1:N-2) T2 = U(2:N-1, 3:N) T3 = U(1:N-2, 2:N-1) T4 = U(3:N, 2:N-1) U(2:N-1, 2:N-1) = 0.25 * (T1 + T2 + T3 + T4) • Assignments to Ts need shift communications, as described earlier. Final assignment is pure computation—no communication.
Problems with array assignment implementation • Fine in terms of volume of inter-processor communication, but • problematic because it involves multiple memory-memory copies of whole arrays. • Original loop had good cache-locality: • spatial locality • temporal locality • Splitting into multiple array copies—multiple loops—causes memory cache to be loaded and flushed several times.
An aside: array syntax and cache. • “Array syntax” style of parallel programming evolved to support SIMD and vector processors. • Direct transcription may lead to poor use of cache on modern microprocessors. • On modern computers, memory access costs typically dominate over costs of arithmetic. • Compiler may have to work hard to fuse sequences of array assignments into loops with good locality… un-vectorizing!
Better approach: Translation using ghost regions REAL U(0:BLK_SIZE1+1, 0:BLK_SIZE2+2) REAL T(BLK_SIZE1, BLK_SIZE2) … Update ghost area of U with values from neighbors DO L1 = 1, BLK_COUNT1 DO L2 = 1, BLK_COUNT2 T(L1, L2) = 0.25 * (U(L1, L2-1) + U(L1, L2+1) + U(L1-1, L2) + U(L1+1, L2)) ENDDO ENDDO … Copy T to interior part of U (local copying)
3. Reductions and other transformational intrinsics • Fortran intrinsic procedures operating on whole arrays, often returning array result. • Some just “reshuffle” elements of arrays: • CSHIFT, EOSHIFT, TRANSPOSE, SPREAD, etc Implementation similar to array assignments. • Another important class reduces arrays to scalar or lower-rank arrays: • SUM, PRODUCT, MAXVAL, MINVAL, etc Communication pattern quite different. Similar to MPI_REDUCE.
Parallel prefix • Arithmetic operation generalizing reduction. Different optimal communication pattern. • Example: FORALL (I = 1 : N) RES (I) = SUM(A (1 : I)) I-th element of result contains sum of elements of A up to I. • HPF library function PREFIX_SUM. • Communication pattern like MPI_SCAN.
4. General subscripting in array parallel statements • Often programs execute code like FORALL (I = 1:50) RES (I) = A (IND (I)) where RES, A, IND all distributed arrays. Subscript not simple linear function of loop index variables. • Cannot be reduced to simple array assignment of type 1. • Cannot be expressed in single MPI collective operation.
Single phase of communication not enough FORALL (I = 1:50) RES (I) = A (IND (I)) • Assume IND aligned with RES (if not, do array assignment to make it so). • Owner of RES(I) also owns IND(I), so knows where source element lives. • But, owner of source element, A(IND(I)), generally does not know it must be moved. • Some request communications needed.
Detailed example !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(CYCLIC) ONTO P REAL RES(50) INTEGER IND(50) !HPF$ DISTRIBUTE RES(BLOCK) ONTO P !HPF$ DISTRIBUTE IND(BLOCK) ONTO P IND = (/ 5, 41, 7, . . . /) FORALL (I = 1:50) RES (I) = A (IND (I))
Problem • First processor can deal with assignment of values A(5), A(41) to RES(1), RES(2) locally. • Element A(7), assigned to RES(3) on first processor, lives on third processor. • First processor knows A(7) is required; third processor, owner, does not. • Some two-way communication needed.
Generalizations of subscripting • Non-linear subscript may be on LHS: FORALL (I = 1:50) RES (IND (I)) = A (I) • Subscripted array multidimensional: FORALL (I = 1:50) & RES (I) = A (IND1 (I), IND2(I)) • Subscripts linear but dependent: FORALL (I = 1:50) RES (I) = A (I, I) • Loop multidimensional: FORALL (I = 1:50, J = 1:50) & RES (I, J) = A (IND (I, J))
General case.Source rank S, dest rank R. • Examples special cases of general gather : FORALL(I = 1:N , …, I = 1:N ) 1 1 R R & RES(I , …,I ) = 1 R & SRC(IND (I , …,I ), … ,IND (I , …,I )) 1 1 R S 1 R and general scatter : FORALL(I = 1:N , …, I = 1:N ) 1 1 S S & RES(IND (I , …,I ), … ,IND (I , …,I )) = 1 1 S R 1 S & SRC(I , …,I ) 1 S
5. Accessing remote data from task-parallel code segments • INDEPENDENT DO loop of HPF: !HPF$ INDEPENDENT DO I = 1, 10 . . . ENDDO • Variable-uses must have no loop-carried data dependence—can execute in parallel. But • nothing forces all variables used in given iteration to have same home—communication needed—and • arbitrary control flow within iteration. Cannot predict before execution what accesses actually happen.
Similar problem calling PURE procedures from FORALL • This is allowed: PURE REAL FUNCTION FOO(REAL X) . . . END FORALL (I = 1 : N) RES (I) = FOO(1.0 * I) • PURE procedures restricted—no side-effects. But • nothing to prevent them reading global data—random element of distributed array in common block, say. • Accesses not predictable from parallel points of call.
One-sided communication • Beyond point-to-point communication, apparently need ability to one-sidedly access memory of remote processor. • Generally, needed where parallel tasks are not sharing a single, logical, “loosely synchronous” thread of control. • MPI 2 standard added one-sided-communication functionality to MPI. Still not widely implemented.
II. Libraries for distributed array communication • Have seen, communication patterns in languages like HPF are quite complex. • Often impractical for compiler to generate low-level MPI calls to execute these. • Hence development of higher-level libraries—directly support operations on distributed arrays. • This section of talk will consider: • CHAOS/PARTI • Adlib
1. CHAOS/PARTI • Series of libraries developed at University of Maryland. • Original PARTI primitives designed to deal with irregular scientific computations. • Also includes Multiblock PARTI—handles problems defined over multiple regular grids.
Irregular applications • Typical case: physical problem discretized on unstructured mesh, represented as a graph. • Example: arrays X, Y defined over nodes of graph. I-th edge of graph connects nodes EDGE1(I) and EDGE2(I).
Characteristic inner loop DO I = 1, NEDGE Y(EDGE1(I)) = Y(EDGE1(I)) + & F(X(EDGE1(I)), X(EDGE2(I))) Y(EDGE2(I)) = Y(EDGE2(I)) + & G(X(EDGE1(I)), X(EDGE2(I))) ENDDO • Value of Y at node is sum of terms, each depending on value of X at ends of edge connected to node.
An irregular problem graph Note: uses irregular distribution of X, Y—1, 2, 5 on P0; 3, 4, 6 on P1. Inessential to discussion here—by permuting node labelling could use HPF-like regular block distribution.
Locality of reference. • Important class of physical problems, while irregular, have property of locality of reference. • Local loops go over locally held elements of indirection vectors; • by suitably partitioning nodes, majority of locally held elements of indirection vectors (EDGE1, EDGE2) reference locally held elements of dataarrays (X, Y). • In example, only edge (2,3) held on P0 causes non-local reference.
PARTI ghost regions • PARTI primitives exploit locality of reference; ghost region method, similar to stencil updates. • Indirection vectors preprocessed to convert global index values to local subscripts. • Indirection to non-local element—local subscript goes into ghost region of data array. • Ghost regions filled or flushed by collective communication routines, called outside local processing loop.
Simplified irregular loop DO I = 1, N X(IA(I)) = X(IA(I)) + Y(IB(I)) ENDDO • Inspector phase takes distribution of data arrays, global subscripts, and returns local subscripts and communication schedules. • Executor phase takes communication schedules. Collective GATHER fills ghost regions. Does local arithmetic. Collective SCATTER_ADD flushes ghost regions back to X array, accumulating.
PARTI inspector and executor for simple irregular loop C Create required schedules (inspector) CALL LOCALIZE(DAD_X, SCHED_IA, IA, LOC_IA, I_BLK_COUNT, & OFF_PROC_X) CALL LOCALIZE(DAD_Y, SCHED_IB, IB, LOC_IB, I_BLK_COUNT, & OFF_PROC_Y) C Actual computation (executor) CALL GATHER(Y(Y_BLK_SIZE+1), Y, SCHED_IB) CALL ZERO_OUT_BUFFER(X(X_BLK_SIZE+1), OFF_PROC_X) DO L = 1, I_BLK_COUNT X(LOC_IA(L)) = X(LOC_IA(L)) + Y(LOC_IB(L)) ENDDO CALL SCATTER_ADD(X(X_BLK_SIZE+1), X, SCHED_IA)
Features • Communication schedule created by analysing requested set of accesses. • Send lists of accessed elements to processors that own them. • Detect appropriate aggregations and redundancy eliminations. • End result: digested list of messages that must be sent, received, with local source and destination elements.
Lessons from CHAOS/PARTI • Important inspector-executor model. • Construction of communication schedules should be separated from execution of schedules. • One benefit: common situation where pattern of subscripting constant over many iterations of an outer loop. • Can lift inspector phase out of main loop. No need to repeat computations every time.
2. Adlib • High-level runtime library, designed to support translation of data parallel languages. • Early version implemented 1994 in the shpf project at Southampton University. • Much improved version produced during the Parallel Compiler Runtime Consortium (PCRC) project at Syracuse.
Features • Built-in model of distributed arrays and sections. • Equivalent to HPF 1.0 model, plus ghost extensions and general block distribution from HPF 2.0. • Collective communication library. • Direct support for array section assignments, ghost region updates, F90 array intrinsics, general gather/scatter. • Implemented on top of MPI. • Adlib kernel implemented in C++. • Object-based distributed array descriptor (DAD)—see previous lecture. Schedule classes.
Communication schedules • All collective operations based on communication schedule objects. • Each kind of operation has associated class of schedules. • Particular instances—involving specific arrays and other parameters—created by class constructors. • Executing schedule object initiates communications, etc, to effect operation.
Advantages of schedule-based approach • As in CHAOS/PARTI, schedules can be reused in case same pattern of accesses repeated. • Even in single-use case, component communications generally have to be aggregated and sorted, for efficiency. • Requires creation of temporary data structures—essentially the schedule. • Adlib only does essential optimization at schedule creation time. Supports, but does not assume amortization over many executions.
The Remap class • Characteristic example of communication schedule class: class Remap { Remap(const DAD* dst, const DAD* src, const int len); void execute(void* dstDat, void* srcDat); }
Operation of Remap • Effects the “array assignment” form of communication. • Copies data between two arrays or sections of same shape and element type. • Source and destination can have any, unrelated mapping. • Similar to operation called regular section move in Multiblock PARTI.
Methods on Remap class • Constructor • dst: DAD for destination array. • src: DAD for source array. • len: Length in bytes of individual element. • execute() • dstDat: address of array of local elements for source array. • srcDat: address of array of local elements for destination array.
HPF example generally needing communication SUBROUTINE ADD_VEC(A, B, C) REAL A(:), B(:), C(:) !HPF$ INHERIT A, B, C FORALL (I = 1:SIZE(A)) A(I) = B(I) + C(I) END • In general arrays A, B, C have different mapping. • May copy B, C to temporaries with same mapping as A.