Communication in Data Parallel Languages

Communication in Data Parallel Languages Bryan Carpenter NPAC at Syracuse University Syracuse, NY 13244 dbc@npac.syr.edu

Goals of this lecture • Discuss patterns of communication needed to implement various constructs in High Performance Fortran. • Introduce some libraries that have been developed to support these communication patterns.

Contents of Lecture • Patterns of communication • Regular communications • Irregular array accesses. • “Unscheduled” access to remote data. • Libraries for distributed array communication • CHAOS/PARTI • Adlib

I. Patterns of communication • Last lecture gave translations of simple fragments of HPF—involved no inter-processor communication. • Realistic HPF programs translate to SPMD programs with communications for accessing array elements on other processors. • Will take a pattern-oriented look at the required communications.

Classifying communication patterns • Will discuss 5 situations: • Array assignments • Stencil problems • Reductions and transformational intrinsics • General subscripting in array-parallel code • Accessing remote data in task parallel code • Don’t claim these are exhaustive, but cover many of the cases arising in useful parallel programs.

1. Array assignments • Variants of the assignment: A = B hide many communication patterns. • A and B may be any conforming array sections. • Will see that communication patterns required encompass most of those occurring in collective communications of MPI.

Array assignment with neighbor communication !HPF$ PROCESSORS P(4) REAL A(50), B(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P !HPF$ DISTRIBUTE B(BLOCK) ONTO P A(1:49) = B(2:50) • In emitted SPMD program, communications might be implemented with MPI_SEND, MPI_RECV, or MPI_SENDRECV.

Assignment with shift in alignment

Array assignment with gather communication !HPF$ PROCESSORS P(4) REAL A(50, 50), B(50) !HPF$ DISTRIBUTE A(*, BLOCK) ONTO P !HPF$ DISTRIBUTE B(BLOCK) ONTO P A(:, 1) = B • In emitted SPMD program, communications might be implemented with MPI_GATHER. • If assignment direction reversed, might use MPI_SCATTER.

Assignment gathering data to one processor

Array assignment gathering to all !HPF$ PROCESSORS P(4) REAL A(50, 50), B(50), C(50) !HPF$ DISTRIBUTE A(*, BLOCK) ONTO P !HPF$ DISTRIBUTE B(BLOCK) ONTO P !HPF$ ALIGN C(I) WITH A(I, *) C = B • C has replicated, collapsed alignment. • Communication in emitted SPMD program, might be implemented with MPI_ALLGATHER.

Array assignments with all-to-all communication REAL A(50, 50), B(50, 50) !HPF$ DISTRIBUTE A(*, BLOCK) ONTO P !HPF$ DISTRIBUTE B(BLOCK, *) ONTO P A = B • or REAL A(50), B(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P !HPF$ DISTRIBUTE B(CYCLIC) ONTO P A = B • Communication in emitted SPMD program, might be implemented with MPI_ALLTOALL.

2. Stencil problems • Updates where element updated in terms of fixed “footprint” of neighboring elements. • Arise in solution of PDEs, cellular automata, image processing…

Famous example: Jacobi relaxation for Laplace FORALL (I = 2:N-1, J = 2:N-1) & U(I, J) = 0.25 * (U(I, J-1) + U(I, J+1) + & U(I-1, J) + U(I+1, J)) • Can be recast in terms of array assignment: U(2:N-1, 2:N-1) = & 0.25 * (U(2:N-1, 1:N-2) + U(2:N-1, 3:N) & + U(1:N-2, 2:N-1) + U(3:N, 2:N-1))

Array assignment version of Laplace solver • Introduce temporaries T1, …, T4 aligned to section U(2:N-1, 2:N-1). Then: T1 = U(2:N-1, 1:N-2) T2 = U(2:N-1, 3:N) T3 = U(1:N-2, 2:N-1) T4 = U(3:N, 2:N-1) U(2:N-1, 2:N-1) = 0.25 * (T1 + T2 + T3 + T4) • Assignments to Ts need shift communications, as described earlier. Final assignment is pure computation—no communication.

Problems with array assignment implementation • Fine in terms of volume of inter-processor communication, but • problematic because it involves multiple memory-memory copies of whole arrays. • Original loop had good cache-locality: • spatial locality • temporal locality • Splitting into multiple array copies—multiple loops—causes memory cache to be loaded and flushed several times.

An aside: array syntax and cache. • “Array syntax” style of parallel programming evolved to support SIMD and vector processors. • Direct transcription may lead to poor use of cache on modern microprocessors. • On modern computers, memory access costs typically dominate over costs of arithmetic. • Compiler may have to work hard to fuse sequences of array assignments into loops with good locality… un-vectorizing!

Better approach: Translation using ghost regions REAL U(0:BLK_SIZE1+1, 0:BLK_SIZE2+2) REAL T(BLK_SIZE1, BLK_SIZE2) … Update ghost area of U with values from neighbors DO L1 = 1, BLK_COUNT1 DO L2 = 1, BLK_COUNT2 T(L1, L2) = 0.25 * (U(L1, L2-1) + U(L1, L2+1) + U(L1-1, L2) + U(L1+1, L2)) ENDDO ENDDO … Copy T to interior part of U (local copying)

Communications needed to update ghost regions

3. Reductions and other transformational intrinsics • Fortran intrinsic procedures operating on whole arrays, often returning array result. • Some just “reshuffle” elements of arrays: • CSHIFT, EOSHIFT, TRANSPOSE, SPREAD, etc Implementation similar to array assignments. • Another important class reduces arrays to scalar or lower-rank arrays: • SUM, PRODUCT, MAXVAL, MINVAL, etc Communication pattern quite different. Similar to MPI_REDUCE.

Parallel prefix • Arithmetic operation generalizing reduction. Different optimal communication pattern. • Example: FORALL (I = 1 : N) RES (I) = SUM(A (1 : I)) I-th element of result contains sum of elements of A up to I. • HPF library function PREFIX_SUM. • Communication pattern like MPI_SCAN.

4. General subscripting in array parallel statements • Often programs execute code like FORALL (I = 1:50) RES (I) = A (IND (I)) where RES, A, IND all distributed arrays. Subscript not simple linear function of loop index variables. • Cannot be reduced to simple array assignment of type 1. • Cannot be expressed in single MPI collective operation.

Single phase of communication not enough FORALL (I = 1:50) RES (I) = A (IND (I)) • Assume IND aligned with RES (if not, do array assignment to make it so). • Owner of RES(I) also owns IND(I), so knows where source element lives. • But, owner of source element, A(IND(I)), generally does not know it must be moved. • Some request communications needed.

Detailed example !HPF$ PROCESSORS P(4) REAL A(50) !HPF$ DISTRIBUTE A(CYCLIC) ONTO P REAL RES(50) INTEGER IND(50) !HPF$ DISTRIBUTE RES(BLOCK) ONTO P !HPF$ DISTRIBUTE IND(BLOCK) ONTO P IND = (/ 5, 41, 7, . . . /) FORALL (I = 1:50) RES (I) = A (IND (I))

Assignment with indirect reference

Problem • First processor can deal with assignment of values A(5), A(41) to RES(1), RES(2) locally. • Element A(7), assigned to RES(3) on first processor, lives on third processor. • First processor knows A(7) is required; third processor, owner, does not. • Some two-way communication needed.

Generalizations of subscripting • Non-linear subscript may be on LHS: FORALL (I = 1:50) RES (IND (I)) = A (I) • Subscripted array multidimensional: FORALL (I = 1:50) & RES (I) = A (IND1 (I), IND2(I)) • Subscripts linear but dependent: FORALL (I = 1:50) RES (I) = A (I, I) • Loop multidimensional: FORALL (I = 1:50, J = 1:50) & RES (I, J) = A (IND (I, J))

General case.Source rank S, dest rank R. • Examples special cases of general gather : FORALL(I = 1:N , …, I = 1:N ) 1 1 R R & RES(I , …,I ) = 1 R & SRC(IND (I , …,I ), … ,IND (I , …,I )) 1 1 R S 1 R and general scatter : FORALL(I = 1:N , …, I = 1:N ) 1 1 S S & RES(IND (I , …,I ), … ,IND (I , …,I )) = 1 1 S R 1 S & SRC(I , …,I ) 1 S

5. Accessing remote data from task-parallel code segments • INDEPENDENT DO loop of HPF: !HPF$ INDEPENDENT DO I = 1, 10 . . . ENDDO • Variable-uses must have no loop-carried data dependence—can execute in parallel. But • nothing forces all variables used in given iteration to have same home—communication needed—and • arbitrary control flow within iteration. Cannot predict before execution what accesses actually happen.

Similar problem calling PURE procedures from FORALL • This is allowed: PURE REAL FUNCTION FOO(REAL X) . . . END FORALL (I = 1 : N) RES (I) = FOO(1.0 * I) • PURE procedures restricted—no side-effects. But • nothing to prevent them reading global data—random element of distributed array in common block, say. • Accesses not predictable from parallel points of call.

One-sided communication • Beyond point-to-point communication, apparently need ability to one-sidedly access memory of remote processor. • Generally, needed where parallel tasks are not sharing a single, logical, “loosely synchronous” thread of control. • MPI 2 standard added one-sided-communication functionality to MPI. Still not widely implemented.

II. Libraries for distributed array communication • Have seen, communication patterns in languages like HPF are quite complex. • Often impractical for compiler to generate low-level MPI calls to execute these. • Hence development of higher-level libraries—directly support operations on distributed arrays. • This section of talk will consider: • CHAOS/PARTI • Adlib

1. CHAOS/PARTI • Series of libraries developed at University of Maryland. • Original PARTI primitives designed to deal with irregular scientific computations. • Also includes Multiblock PARTI—handles problems defined over multiple regular grids.

Irregular applications • Typical case: physical problem discretized on unstructured mesh, represented as a graph. • Example: arrays X, Y defined over nodes of graph. I-th edge of graph connects nodes EDGE1(I) and EDGE2(I).

Characteristic inner loop DO I = 1, NEDGE Y(EDGE1(I)) = Y(EDGE1(I)) + & F(X(EDGE1(I)), X(EDGE2(I))) Y(EDGE2(I)) = Y(EDGE2(I)) + & G(X(EDGE1(I)), X(EDGE2(I))) ENDDO • Value of Y at node is sum of terms, each depending on value of X at ends of edge connected to node.

An irregular problem graph Note: uses irregular distribution of X, Y—1, 2, 5 on P0; 3, 4, 6 on P1. Inessential to discussion here—by permuting node labelling could use HPF-like regular block distribution.

Locality of reference. • Important class of physical problems, while irregular, have property of locality of reference. • Local loops go over locally held elements of indirection vectors; • by suitably partitioning nodes, majority of locally held elements of indirection vectors (EDGE1, EDGE2) reference locally held elements of dataarrays (X, Y). • In example, only edge (2,3) held on P0 causes non-local reference.

PARTI ghost regions • PARTI primitives exploit locality of reference; ghost region method, similar to stencil updates. • Indirection vectors preprocessed to convert global index values to local subscripts. • Indirection to non-local element—local subscript goes into ghost region of data array. • Ghost regions filled or flushed by collective communication routines, called outside local processing loop.

Simplified irregular loop DO I = 1, N X(IA(I)) = X(IA(I)) + Y(IB(I)) ENDDO • Inspector phase takes distribution of data arrays, global subscripts, and returns local subscripts and communication schedules. • Executor phase takes communication schedules. Collective GATHER fills ghost regions. Does local arithmetic. Collective SCATTER_ADD flushes ghost regions back to X array, accumulating.

PARTI inspector and executor for simple irregular loop C Create required schedules (inspector) CALL LOCALIZE(DAD_X, SCHED_IA, IA, LOC_IA, I_BLK_COUNT, & OFF_PROC_X) CALL LOCALIZE(DAD_Y, SCHED_IB, IB, LOC_IB, I_BLK_COUNT, & OFF_PROC_Y) C Actual computation (executor) CALL GATHER(Y(Y_BLK_SIZE+1), Y, SCHED_IB) CALL ZERO_OUT_BUFFER(X(X_BLK_SIZE+1), OFF_PROC_X) DO L = 1, I_BLK_COUNT X(LOC_IA(L)) = X(LOC_IA(L)) + Y(LOC_IB(L)) ENDDO CALL SCATTER_ADD(X(X_BLK_SIZE+1), X, SCHED_IA)

Features • Communication schedule created by analysing requested set of accesses. • Send lists of accessed elements to processors that own them. • Detect appropriate aggregations and redundancy eliminations. • End result: digested list of messages that must be sent, received, with local source and destination elements.

Lessons from CHAOS/PARTI • Important inspector-executor model. • Construction of communication schedules should be separated from execution of schedules. • One benefit: common situation where pattern of subscripting constant over many iterations of an outer loop. • Can lift inspector phase out of main loop. No need to repeat computations every time.

2. Adlib • High-level runtime library, designed to support translation of data parallel languages. • Early version implemented 1994 in the shpf project at Southampton University. • Much improved version produced during the Parallel Compiler Runtime Consortium (PCRC) project at Syracuse.

Features • Built-in model of distributed arrays and sections. • Equivalent to HPF 1.0 model, plus ghost extensions and general block distribution from HPF 2.0. • Collective communication library. • Direct support for array section assignments, ghost region updates, F90 array intrinsics, general gather/scatter. • Implemented on top of MPI. • Adlib kernel implemented in C++. • Object-based distributed array descriptor (DAD)—see previous lecture. Schedule classes.

Communication schedules • All collective operations based on communication schedule objects. • Each kind of operation has associated class of schedules. • Particular instances—involving specific arrays and other parameters—created by class constructors. • Executing schedule object initiates communications, etc, to effect operation.

Advantages of schedule-based approach • As in CHAOS/PARTI, schedules can be reused in case same pattern of accesses repeated. • Even in single-use case, component communications generally have to be aggregated and sorted, for efficiency. • Requires creation of temporary data structures—essentially the schedule. • Adlib only does essential optimization at schedule creation time. Supports, but does not assume amortization over many executions.

The Remap class • Characteristic example of communication schedule class: class Remap { Remap(const DAD* dst, const DAD* src, const int len); void execute(void* dstDat, void* srcDat); }

Operation of Remap • Effects the “array assignment” form of communication. • Copies data between two arrays or sections of same shape and element type. • Source and destination can have any, unrelated mapping. • Similar to operation called regular section move in Multiblock PARTI.

Methods on Remap class • Constructor • dst: DAD for destination array. • src: DAD for source array. • len: Length in bytes of individual element. • execute() • dstDat: address of array of local elements for source array. • srcDat: address of array of local elements for destination array.

HPF example generally needing communication SUBROUTINE ADD_VEC(A, B, C) REAL A(:), B(:), C(:) !HPF$ INHERIT A, B, C FORALL (I = 1:SIZE(A)) A(I) = B(I) + C(I) END • In general arrays A, B, C have different mapping. • May copy B, C to temporaries with same mapping as A.

Communication in Data Parallel Languages

Communication in Data Parallel Languages

Presentation Transcript

Parallel Syntactic Annotation of Multiple Languages

Agents Communication Languages (ACL)

Parallel Data Cubing

Parallel Data Mining

Communication. Languages.

Parallel programming languages

Contemporary Languages in Parallel Computing

Optical Technologies for Data Communication in Large Parallel Systems

Parallel Data Structures

COMMUNICATION IN FOREIGN LANGUAGES

Data Parallel Languages (Chapter 4)

Parallel Programming Languages

Parallel Operations in Data Warehouses

DATA PARALLEL LANGUAGES (Chapter 4b)

Parallel digital data

L23: Future Parallel Programming Languages

Serial Versus Parallel Communication

Communication Skills Body Languages

Parallel Data Cube

Communication Optimizations for Parallel Computing Using Data Access Information

Data Parallel Pattern