Compiling High Performance Fortran

Compiling High Performance Fortran Allen and Kennedy, Chapter 14

Overview • Motivation for HPF • Overview of compiling HPF programs • Basic Loop Compilation for HPF • Optimizations for compiling HPF • Results and Summary

Motivation for HPF • Require “Message Passing” to communicate data between processors • Approach 1: Use MPI calls in Fortran/C code Scalable Distributed Memory Multiprocessor

Consider the following sum reduction PROGRAM SUM REAL A(10000) READ (9) A SUM = 0.0 DO I = 1, 10000 SUM = SUM + A(I) ENDDO PRINT SUM END PROGRAM SUM REAL A(100), BUFF(100) IF (PID == 0) THEN DO IP = 0, 99 READ (9) BUFF(1:100) IF (IP == 0) A(1:100) = BUFF(1:100) ELSE SEND(IP,BUFF,100) ENDDO ELSE RECV(0,A,100) ENDIF /*Actual sum reduction code here */ IF (PID == 0) SEND(1,SUM,1) IF (PID > 0) RECV(PID-1,T,1) SUM = SUM + T IF (PID < 99) SEND(PID+1,SUM,1) ELSE SEND(0,SUM,1) ENDIF IF (PID == 0) PRINT SUM; END Motivation for HPF MPI implementation

Motivation for HPF • Disadvantages of MPI approach • User has to rewrite the program in SPMD form [Single Program Multiple Data] • User has to manage data movement [send & receive], data placement and synchronization • Too messy and not easy to master

Motivation for HPF • Approach 2: Use HPF • HPF is an extended version of Fortran 90 • HPF has Fortran 90 features and a few directives • Directives • Tell how data is laid out in processor memories in parallel machine configuration. For example, • !HPF DISTRIBUTE A(BLOCK) • Assist in identifying parallelism. For example, • !HPF INDEPENDENT

The same sum reduction code PROGRAM SUM REAL A(10000) READ (9) A SUM = 0.0 DO I = 1, 10000 SUM = SUM + A(I) ENDDO PRINT SUM END When written in HPF... PROGRAM SUM REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) READ (9) A SUM = 0.0 DO I = 1, 10000 SUM = SUM + A(I) ENDDO PRINT SUM END Minimum modification Easy to write Now compiler has to do more work Motivation for HPF

Motivation for HPF • Advantages of HPF • User needs only to write some easy directives; need not write the whole program in SPMD form • User does not need to manage data movement [send & receive] and synchronization • Simple and easy to master

Dependence Analysis Used for communication analysis Fact used: No dependence carried by I loop Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 DO I = 2, 10000 S1: A(I) = B(I-1) + C ENDDO DO I = 1, 10000 S2: B(I) = A(I) ENDDO ENDDO HPF Compilation Overview

Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 DO I = 2, 10000 S1: A(I) = B(I-1) + C ENDDO DO I = 1, 10000 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis HPF Compilation Overview

Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 DO I = 2, 10000 S1: A(I) = B(I-1) + C ENDDO DO I = 1, 10000 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Partition so as to distribute work of the I loops HPF Compilation Overview

REAL A(1,100), B(0:100) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 I1: IF (PID /= 100) SEND(PID+1,B(100),1) I2: IF (PID /= 0) THEN RECV(PID-1,B(0),1) A(1) = B(0) + C ENDIF DO I = 2, 100 S1: A(I) = B(I-1)+C ENDDO DO I = 1, 100 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Communication Analysis and placement Communication reqd for B(0)for each iteration Shadow region B(0) HPF Compilation Overview

REAL A(1,100), B(0:100) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, 10000 I1: IF (PID /= 100) SEND(PID+1,B(100),1) DO I = 2, 100 S1: A(I) = B(I-1)+C ENDDO I2: IF (PID /= 0) THEN RECV(PID-1,B(0),1) A(1) = B(0) + C ENDIF DO I = 1, 100 S2: B(I) = A(I) ENDDO ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Communication Analysis and placement Optimization Aggregation Overlap communication and computation Recognition of reduction HPF Compilation Overview

Basic Loop Compilation • Distribution Propagation and analysis • Analyze what distribution holds for a given array at a given point in the program • Difficult due to • REALIGN and REDISTRIBUTE directives • Distribution of formal parameters inherited from calling procedure • Use “Reaching Decompositions” data flow analysis and its interprocedural version

Basic Loop Compilation • For simplicity assume single distribution for an array at all points in a subprogram • Define • For example suppose array A of size N is block distributed over p processors • Block size,

Iteration Partitioning Dividing work among processors Computation partitioning Determine which iterations of a loop will be executed on which processor Owner-computes rule REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) DO I = 1, 10000 A(I) = A(I) + C ENDDO Iteration I is executed on owner of A(I) 100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on Basic Loop Compilation

Iteration Partitioning • Multiple statements in a loop in a recurrence: choose a partitioning reference • Processor responsible for performing computation for iteration I is • Set of indices executed on p

Iteration Partitioning • Have to map global loop index to local loop index • Smallest value in maps to 1 REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) DO I = 1, N A(I+1) = B(I) + C ENDDO

Iteration Partitioning REAL A(10000),B(10000) !HPF$ DISTRIBUTE A(BLOCK),B(BLOCK) DO I = 1, N A(I+1) = B(I) + C ENDDO • Map global iteration space, I to local iteration space,i as follows:

Iteration Partitioning • Adjust array subscripts for local iterations:

Iteration Partitioning • For interior processors the code becomes.. DO i = 1, 100 A(i) = B(i-1) + C ENDDO • Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions.. lo = 1 IF (PID==0) lo = 2 hi = 100 IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1 DO i = lo, hi A(i) = B(i-1) + C ENDDO

Communication Generation • For our example no communication is required for iterations in • Iterations which require receiving data are • Iterations which require sending data are

Communication Generation REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) ... DO I = 1, N A(I+1) = B(I) + C ENDDO • Receive required for iterations in [100p:100p] • Send required for iterations in [100p+100:100p+100] • No communication required for iterations in [100p+1:100p+99]

After inserting receive lo = 1 IF (PID==0) lo = 2 hi = 100 IF (PID==CEIL((N+1)/100)-1) hi = MOD(N,100) + 1 DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO Send must happen in the 101st iteration lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 DO i = lo, hi+1 IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) IF (i <= hi) THEN A(i) = B(i-1) + C ENDIF IF (i == hi+1 && PID /= lastP) SEND(PID+1, B(100), 1) ENDDO Communication Generation

lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 DO i = lo, hi+1 IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) IF (i <= hi) THEN A(i) = B(i-1) + C ENDIF IF (i == hi+1 && PID /= lastP) SEND(PID+1, B(100), 1) ENDDO Move SEND outside the loop lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Communication Generation

lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Move receive outside loop and loop peel lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ! lo = MAX(lo,1+1) == 2 DO i = 2, hi A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Communication Generation

lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ! lo = MAX(lo,1+1) == 2 DO i = 2, hi A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100), 1) IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF DO i = 2, hi A(i) = B(i-1) + C ENDDO ENDIF Communication Generation

Communication Generation • When is such rearrangement legal? • Receive: copy from global to local location • Send: copy local to global location IF (PID <= lastP) THEN S1: IF (lo == 1 && PID /= 0) THEN B(0) = Bg(0) ! RECV A(1) = B(0) + C ENDIF DO i = 2, hi A(i) = B(i-1) + C ENDDO S2: IF (PID /= lastP) Bg(100) = B(100) ! SEND ENDIF No chain of dependences from S1 to S2

REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK) ... DO I = 1, N A(I+1) = A(I) + C ENDDO Would be rewritten as .. IF (PID <= lastP) THEN S1: IF (lo == 1 && PID /= 0) THEN A(0) = Ag(0) ! RECV A(1) = A(0) + C ENDIF DO i = 2, hi A(i) = A(i-1) + C ENDDO S2: IF (PID /= lastP) Ag(100) = A(100) ! SEND ENDIF Rearrangement won’t be correct Communication Generation

REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*), B(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = B(I,J) + C ENDDO ENDDO Using Basic Loop compilation gives.. DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100,J), 1) IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO Communication Vectorization

DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100,J), 1) IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (PID /= lastP) SEND(PID+1, B(100,J), 1) ENDDO DO J = 1, M IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(i,J) = B(i-1,J) + C ENDIF ENDDO DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDDO ENDIF Communication Vectorization Distribute J Loop

lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (PID /= lastP) SEND(PID+1, B(100,J), 1) ENDDO DO J = 1, M IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(i,J) = B(i-1,J) + C ENDIF ENDDO DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDDO ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1) THEN RECV (PID-1, B(0,1:M), M) DO J = 1, M A(1,J) = B(0,J) + C ENDDO ENDIF DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDDO IF (PID /= lastP) SEND(PID+1, B(100,1:M), M) ENDIF Communication Vectorization

DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S1: IF (PID /= lastP) Bg(100,J)=B(100,J) IF (lo == 1) THEN S2: B(0,J)=Bg(0,J) S3: A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi S4: A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop Communication Vectorization

REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*), B(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = A(I,J) + B(I,J) ENDDO ENDDO Can sends be done before the receives? Can communication be vectorized? REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J+1) = A(I,J) + C ENDDO ENDDO Can sends be done before the receives? Can communication be fully vectorized? Communication Vectorization

lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S0: IF (PID /= lastP) SEND(PID+1, B(100), 1) S1: IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF L1: DO i = 2, hi A(i) = B(i-1) + C ENDDO ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S0: IF (PID /= lastP) SEND(PID+1, B(100), 1) L1: DO i = 2, hi A(i) = B(i-1) + C ENDDO S1: IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ENDIF Overlapping Communication and Computation

REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = A(I,J) + C ENDDO ENDDO Initial code generation for the I loop gives.. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (lo == 1) THEN RECV (PID-1, A(0,J), 1) A(1,J) = A(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J), 1) ENDDO ENDIF Pipelining Can be vectorized But gives up parallelism

Pipelining • Pipelined parallelism with communication

Pipelining • Pipelined parallelism with communication overhead

lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (lo == 1) THEN RECV (PID-1, A(0,J), 1) A(1,J) = A(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J), 1) ENDDO ENDIF ... IF (PID <= lastP) THEN DO J = 1, M, K IF (lo == 1) THEN RECV (PID-1, A(0,J:J+K-1), K) DO j = J, J+K-1 A(1,J) = A(0,J) + C ENDDO ENDIF DO j = J, J+K-1 DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J:J+K-1),K) ENDDO ENDIF Pipelining: Blocking

Other Optimizations • Alignment and Replication • Identification of Common recurrences • Storage Mangement • Minimize temporary storage used for communication • Space taken for temporary storage should be at most equal to the space taken by the arrays • Interprocedural Optimizations

Results

Summary • HPF is easy to code • But hard to compile • Steps required to compile HPF programs • Basic loop compilation • Communication generation • Optimizations • Communication vectorization • Overlapping communication with computation • Pipelining

Compiling High Performance Fortran

Compiling High Performance Fortran

Presentation Transcript

Fortran

FORTRAN

Compiling

Introduction to Fortran and Fortran Compiling

FORTRAN

Compiling

FORTRAN

HPF (High Performance Fortran)

Fortran

FORTRAN

Refining High Performance FORTRAN Code from Programming Model Dependencies

High Performance Fortran (HPF)

High Performance Fortran (HPF)

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing

FORTRAN

Fortran

FORTRAN

FORTRAN