450 likes | 603 Views
Compiling High Performance Fortran. Allen and Kennedy, Chapter 14. Overview. Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary. Motivation for HPF.
E N D
Compiling High Performance Fortran Allen and Kennedy, Chapter 14
Overview • Motivation for HPF • Overview of compiling HPF programs • Basic Loop Compilation for HPF • Optimizations for compiling HPF • Results and Summary
Motivation for HPF • Require “Message Passing” to communicate data between processors • Approach 1: Use MPI calls in Fortran/C code Scalable Distributed Memory Multiprocessor
Consider the following sum reduction PROGRAM SUM REAL A(10000) READ (9) A SUM = 0.0 DO I = 1, 10000 SUM = SUM + A(I) ENDDO PRINT SUM END PROGRAM SUM REAL A(100), BUFF(100) IF (PID == 0) THEN DO IP = 0, 99 READ (9) BUFF(1:100) IF (IP == 0) A(1:100) = BUFF(1:100) ELSE SEND(IP,BUFF,100) ENDDO ELSE RECV(0,A,100) ENDIF /*Actual sum reduction code here */ IF (PID == 0) SEND(1,SUM,1) IF (PID > 0) RECV(PID-1,T,1) SUM = SUM + T IF (PID < 99) SEND(PID+1,SUM,1) ELSE SEND(0,SUM,1) ENDIF IF (PID == 0) PRINT SUM; END Motivation for HPF MPI implementation
Motivation for HPF • Disadvantages of MPI approach • User has to rewrite the program in SPMD form [Single Program Multiple Data] • User has to manage data movement [send & receive], data placement and synchronization • Too messy and not easy to master
Motivation for HPF • Approach 2: Use HPF • HPF is an extended version of Fortran 90 • HPF has Fortran 90 features and a few directives • Directives • Tell how data is laid out in processor memories in parallel machine configuration. For example, • !HPF DISTRIBUTE A(BLOCK) • Assist in identifying parallelism. For example, • !HPF INDEPENDENT
The same sum reduction code PROGRAM SUM REAL A(10000) READ (9) A SUM = 0.0 DO I = 1, 10000 SUM = SUM + A(I) ENDDO PRINT SUM END When written in HPF... PROGRAM SUM REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) READ (9) A SUM = 0.0 DO I = 1, 10000 SUM = SUM + A(I) ENDDO PRINT SUM END Minimum modification Easy to write Now compiler has to do more work Motivation for HPF
Motivation for HPF • Advantages of HPF • User needs only to write some easy directives; need not write the whole program in SPMD form • User does not need to manage data movement [send & receive] and synchronization • Simple and easy to master
Overview • Motivation for HPF • Overview of compiling HPF programs • Basic Loop Compilation for HPF • Optimizations for compiling HPF • Results and Summary
Dependence Analysis Used for communication analysis Fact used: No dependence carried by I loop Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 DO I = 2, 10000 S1: A(I) = B(I-1) + C ENDDO DO I = 1, 10000 S2: B(I) = A(I) ENDDO ENDDO HPF Compilation Overview
Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 DO I = 2, 10000 S1: A(I) = B(I-1) + C ENDDO DO I = 1, 10000 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis HPF Compilation Overview
Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 DO I = 2, 10000 S1: A(I) = B(I-1) + C ENDDO DO I = 1, 10000 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Partition so as to distribute work of the I loops HPF Compilation Overview
REAL A(1,100), B(0:100) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 I1: IF (PID /= 100) SEND(PID+1,B(100),1) I2: IF (PID /= 0) THEN RECV(PID-1,B(0),1) A(1) = B(0) + C ENDIF DO I = 2, 100 S1: A(I) = B(I-1)+C ENDDO DO I = 1, 100 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Communication Analysis and placement Communication reqd for B(0)for each iteration Shadow region B(0) HPF Compilation Overview
REAL A(1,100), B(0:100) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 I1: IF (PID /= 100) SEND(PID+1,B(100),1) DO I = 2, 100 S1: A(I) = B(I-1)+C ENDDO I2: IF (PID /= 0) THEN RECV(PID-1,B(0),1) A(1) = B(0) + C ENDIF DO I = 1, 100 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Communication Analysis and placement Optimization Aggregation Overlap communication and computation Recognition of reduction HPF Compilation Overview
Overview • Motivation for HPF • Overview of compiling HPF programs • Basic Loop Compilation for HPF • Optimizations for compiling HPF • Results and Summary
Basic Loop Compilation • Distribution Propagation and analysis • Analyze what distribution holds for a given array at a given point in the program • Difficult due to • REALIGN and REDISTRIBUTE directives • Distribution of formal parameters inherited from calling procedure • Use “Reaching Decompositions” data flow analysis and its interprocedural version
Basic Loop Compilation • For simplicity assume single distribution for an array at all points in a subprogram • Define • For example suppose array A of size N is block distributed over p processors • Block size,
Iteration Partitioning Dividing work among processors Computation partitioning Determine which iterations of a loop will be executed on which processor Owner-computes rule REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) DO I = 1, 10000 A(I) = A(I) + C ENDDO Iteration I is executed on owner of A(I) 100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on Basic Loop Compilation
Iteration Partitioning • Multiple statements in a loop in a recurrence: choose a partitioning reference • Processor responsible for performing computation for iteration I is • Set of indices executed on p
Iteration Partitioning • Have to map global loop index to local loop index • Smallest value in maps to 1 REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) DO I = 1, N A(I+1) = B(I) + C ENDDO
Iteration Partitioning REAL A(10000),B(10000) !HPF$ DISTRIBUTE A(BLOCK),B(BLOCK) DO I = 1, N A(I+1) = B(I) + C ENDDO • Map global iteration space, I to local iteration space,i as follows:
Iteration Partitioning • Adjust array subscripts for local iterations:
Iteration Partitioning • For interior processors the code becomes.. DO i = 1, 100 A(i) = B(i-1) + C ENDDO • Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions.. lo = 1 IF (PID==0) lo = 2 hi = 100 IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1 DO i = lo, hi A(i) = B(i-1) + C ENDDO
Communication Generation • For our example no communication is required for iterations in • Iterations which require receiving data are • Iterations which require sending data are
Communication Generation REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) ... DO I = 1, N A(I+1) = B(I) + C ENDDO • Receive required for iterations in [100p:100p] • Send required for iterations in [100p+100:100p+100] • No communication required for iterations in [100p+1:100p+99]
After inserting receive lo = 1 IF (PID==0) lo = 2 hi = 100 IF (PID==CEIL((N+1)/100)-1) hi = MOD(N,100) + 1 DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO Send must happen in the 101st iteration lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 DO i = lo, hi+1 IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) IF (i <= hi) THEN A(i) = B(i-1) + C ENDIF IF (i == hi+1 && PID /= lastP) SEND(PID+1, B(100), 1) ENDDO Communication Generation
lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 DO i = lo, hi+1 IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) IF (i <= hi) THEN A(i) = B(i-1) + C ENDIF IF (i == hi+1 && PID /= lastP) SEND(PID+1, B(100), 1) ENDDO Move SEND outside the loop lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Communication Generation
lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Move receive outside loop and loop peel lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ! lo = MAX(lo,1+1) == 2 DO i = 2, hi A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Communication Generation
lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ! lo = MAX(lo,1+1) == 2 DO i = 2, hi A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100), 1) IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF DO i = 2, hi A(i) = B(i-1) + C ENDDO ENDIF Communication Generation
Communication Generation • When is such rearrangement legal? • Receive: copy from global to local location • Send: copy local to global location IF (PID <= lastP) THEN S1: IF (lo == 1 && PID /= 0) THEN B(0) = Bg(0) ! RECV A(1) = B(0) + C ENDIF DO i = 2, hi A(i) = B(i-1) + C ENDDO S2: IF (PID /= lastP) Bg(100) = B(100) ! SEND ENDIF No chain of dependences from S1 to S2
REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK) ... DO I = 1, N A(I+1) = A(I) + C ENDDO Would be rewritten as .. IF (PID <= lastP) THEN S1: IF (lo == 1 && PID /= 0) THEN A(0) = Ag(0) ! RECV A(1) = A(0) + C ENDIF DO i = 2, hi A(i) = A(i-1) + C ENDDO S2: IF (PID /= lastP) Ag(100) = A(100) ! SEND ENDIF Rearrangement won’t be correct Communication Generation
Overview • Motivation for HPF • Overview of compiling HPF programs • Basic Loop Compilation for HPF • Optimizations for compiling HPF • Results and Summary
REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*), B(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = B(I,J) + C ENDDO ENDDO Using Basic Loop compilation gives.. DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100,J), 1) IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO Communication Vectorization
DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100,J), 1) IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (PID /= lastP) SEND(PID+1, B(100,J), 1) ENDDO DO J = 1, M IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(i,J) = B(i-1,J) + C ENDIF ENDDO DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDDO ENDIF Communication Vectorization Distribute J Loop
lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (PID /= lastP) SEND(PID+1, B(100,J), 1) ENDDO DO J = 1, M IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(i,J) = B(i-1,J) + C ENDIF ENDDO DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDDO ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1) THEN RECV (PID-1, B(0,1:M), M) DO J = 1, M A(1,J) = B(0,J) + C ENDDO ENDIF DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDDO IF (PID /= lastP) SEND(PID+1, B(100,1:M), M) ENDIF Communication Vectorization
DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S1: IF (PID /= lastP) Bg(100,J)=B(100,J) IF (lo == 1) THEN S2: B(0,J)=Bg(0,J) S3: A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi S4: A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop Communication Vectorization
REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*), B(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = A(I,J) + B(I,J) ENDDO ENDDO Can sends be done before the receives? Can communication be vectorized? REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J+1) = A(I,J) + C ENDDO ENDDO Can sends be done before the receives? Can communication be fully vectorized? Communication Vectorization
lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S0: IF (PID /= lastP) SEND(PID+1, B(100), 1) S1: IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF L1: DO i = 2, hi A(i) = B(i-1) + C ENDDO ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S0: IF (PID /= lastP) SEND(PID+1, B(100), 1) L1: DO i = 2, hi A(i) = B(i-1) + C ENDDO S1: IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ENDIF Overlapping Communication and Computation
REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = A(I,J) + C ENDDO ENDDO Initial code generation for the I loop gives.. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (lo == 1) THEN RECV (PID-1, A(0,J), 1) A(1,J) = A(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J), 1) ENDDO ENDIF Pipelining Can be vectorized But gives up parallelism
Pipelining • Pipelined parallelism with communication
Pipelining • Pipelined parallelism with communication overhead
lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (lo == 1) THEN RECV (PID-1, A(0,J), 1) A(1,J) = A(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J), 1) ENDDO ENDIF ... IF (PID <= lastP) THEN DO J = 1, M, K IF (lo == 1) THEN RECV (PID-1, A(0,J:J+K-1), K) DO j = J, J+K-1 A(1,J) = A(0,J) + C ENDDO ENDIF DO j = J, J+K-1 DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J:J+K-1),K) ENDDO ENDIF Pipelining: Blocking
Other Optimizations • Alignment and Replication • Identification of Common recurrences • Storage Mangement • Minimize temporary storage used for communication • Space taken for temporary storage should be at most equal to the space taken by the arrays • Interprocedural Optimizations
Summary • HPF is easy to code • But hard to compile • Steps required to compile HPF programs • Basic loop compilation • Communication generation • Optimizations • Communication vectorization • Overlapping communication with computation • Pipelining