Loop Tiling for Iterative Stencil Computations

Loop Tiling for Iterative Stencil Computations Marta Jiménez

What is an Iterative Stencil Computation? Matrix A DO K = 1, NITER /* time-step loop */ do J = ... do I = ... {A(I,J), A(I+1,J),…} enddo enddo {wrapped-around computations} ENDDO • ISC often performed for PDE, GM, IP • swim, tomcatv, mgrid (from SPEC95 benchmark) • Jacobi

Loop Tiling • Loop Tiling • divides IS into regular tiles to make the working set fit in the memory level being exploited • can be applied hierarchically (Multilevel Tiling) • Current algorithms for Loop Tiling are limited to loops that: • are “perfectly” nested • are fully permutable • define a rectangular IS • However, in iterative stencil computations, loops are: • NOT perfectly nested • NOT fully permutable

Today’s talk • Show how Loop Tiling can be applied to iterative stencil computations • based on Song & Li’s paper [PLDI99] • define a Program Model • 1 Level of 1D-Tiling (cache) • program example: SWIM • 2 levels of Tiling • 2D-Tiling at the cache level • 1D-Tiling at the register level (based on Jiménez et al. [ICS98][HPCA98]) • Performance Results • Loop Tiling on EV5 & EV6

Steps 1- Apply a set of transformations to the original program to achieve the desired program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level

1st Step: achieve desired program model • Program Model: DO K = 1, NITER /* time-step loop */ do J1 = LJ1, UJ1 do I1 = LI1, UI1 {A(I,J), A(I+1,J),…} enddo enddo . . . do Jm = LJm, UJm do Im = LIm, UIm {A(I,J), A(I+1,J),…} enddo enddo ENDDO • Usually, programs are NOT directly written in this form • We must apply a set of transformations to achieve this program model

SWIM original code SUBROUTINE CALCX do J = 1,N do I = 1,M ... enddo enddo c wrapped-around computations do J = 1, N ... enddo do I = 1, M ... enddo ... initializations 90 NCYCLE = NCYCLE +1 CALL CALC1 CALL CALC2 IF (NCYCLE >= ITMAX) STOP IF (NCYCLE <= 1) THEN CALL CALC3Z ELSE CALL CALC3 ENDIF GO TO 90 • Transformations • Inline subroutines • Convert GO TO into DO-loop • Peel iterations of the time-step loop to eliminate IF-statements guarded by NCYCLE

Wrapped-around Computations J J DO K = 2, ITMAX-1 do J = 1,N do I = 1,M ... enddo enddo wrapped-around comp do J = 1, N ... enddo do I = 1, M ... enddo ... do J = 1,N do I = 1,M ... enddo enddo ... ... ENDDO I I CALC1 CALC2 CALC3

Wrapped-around Computations • Projection along directionI DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo ... ENDDO J c c • Another way of dealing with the wrapped-around computations is performing code sinking

1st Step: achieved program model • Flow dependencies & iterations space for SWIM (Projection along directionI ) J 1 N DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around ENDDO CALC1 K=2 CALC2 K-loop (time) K=3 CALC3

Steps 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level

1D-Tiling J 1 N J 1 N 1 N K=2 OFFSET-i SLOPE K=3 K=4 • Dependencies are violated • Tiling parameters: SLOPE, OFFSETS-i

2D-Tiling J 1 N 1 N 1 N I 1 N 1 N 1 N 1 1 M M 1 1 M M 1 1 M M K (time-step loop) • Tiling parameters: SLOPE, OFFSETS-i for each tiled dimension (JandI) • Computed using theJI-loop distance subgraph

output dependencies JI-loop Distance Subgraph [1,-1,-1] [0,0,0] [1,0,0] JI1-loop JI2-loop JI3-loop [1,-1,0] [1,0,-1] [1, 0, 0] [1,-1,0] [1,0,-1] [1, 0, 0] [1, 0, 0] [1,0,-1] [1,-1,0] [0,0,0] flow dependencies anti-dependencies • Each node represents a JI-loop nest • Each edge represents a dependence (distance vector)

Wrapped-around Computations • SWIM: Projection along direction I J 1 N DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around ENDDO K=2 K-loop (time) K=3 • Backward dependencies with large distances make Tiling not profitable • apply Circular Loop Skewing to shorten backward dependencies

Circular Loop Skewing • Shorts backward dependencies by changing the iteration order J J 1 2 N 1 2 N 1 2 3 4 K=2 BETA-i DELTA K=3 • CLS parameters: BETA-i, DELTA (computed using theJI-loop distance subgraph)

J 1 N 1 2 3 4 K=2 BETA-i DELTA K=3 Circular Loop Skewing DO K = 2, ITMAX-1 do JX = 1+BETA1+DELTA(K-2), N+BETA1+DELTA(K-2) J = MOD(JX-1, N) + 1 ... enddo wrapped-around do JX = 1+BETA2+DELTA(K-2), N+BETA2+DELTA(K-2) J = MOD(JX-1, N) + 1 ... enddo wrapped-around do JX = 1+BETA3+DELTA(K-2), N+BETA3+DELTA(K-2) J = MOD(JX-1, N) + 1 ... enddo wrapped-around ENDDO

2nd Step: 2D-Tiling for cache level • SWIM: projection along directionI • CLS parameters: DELTA=2, BETA1=0, BETA2=1, BETA3=2 • Tiling parameters: SLOPE=2, OFFSET1=1, OFFSET2=OFFSET3=0 J 1 2 3 N 1 2 3 DO JJ = ... DO II = ... DO K = ... if (first tile) then do JX = ... offsets iter. enddo endif do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo ENDDO 2 3 1 2 3 N 1 K=2 K=3 K=4

Steps 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level

3rd Step: 1D-Tiling for register level DO JJ = ... DO II = ... DO K = ... ... do JX = LJ, UJ J = MOD (JX-1, N)+1 do IX = LI, UI I = MOD (IX-1, M)+1 [loop body: {I,J}] enddo enddo ... ENDDO J N-2 N-1 N 1 2 I M-2 M-1 M 1 2 unrolled • The MOD operation introduced by CLS prevents us to fully unroll the loop • Apply first Index Set Splitting to loop J

Index Set Splitting • ISS splits a loop into two new loops that iterate over non-intersecting portions of the iteration space DO JJ = ... DO II = ... DO K = ... ... do JX = LJ, min(N,UJ) J = JX do IX = ... enddo enddo do JX = max(N+1,LJ), UJ J = JX-N do IX = ... enddo enddo ... ENDDO J N-2 N-1 N 1 2 I M-2 M-1 M 1 2 ISS

DO JJ = ... DO II = ... DO K = ... ... do JX = LJ, min(N,UJ)-3+1,3 J = JX do IX = ... [loop body: {J}] [loop body: {J+1}] [loop body: {J+2}] enddo enddo do JX = JX, min(N,UJ) J = JX do IX = ... [loop body: {J}] enddo enddo ... ENDDO J N-2 N-1 N 1 2 I M-2 M-1 M 1 2 ISS 3rd Step: 1D-Tiling for register level

Code Transformations Summary 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li • Inline subroutines • Convert GOTO into DO-loop • Peel iterations of the time-step loop to eliminate IF-statements 2- Perform 2D-Tiling for the Cache Level • Construct JI-loop distance subgraph • Compute DELTA and BETAs and apply CLS to shorten backwards dep. • Update JI-loop distance subgraph • Compute OFSSETs and SLOPE and tile the IS 3- Perform 1D-Tiling for the Register Level • Index Set Splitting • Tiling in a straightforward manner

Performance Results (SWIM) • Architecture: EV56 (500Mhz, L1:8KB, L2:96KB), EV6(500MHz, L1:64KB, L2:4MB) • Compiler Invocation: • f77 -O5 -arch ev56 (EV5) • kf77 -O5 -arch ev6 -notransform_loop -unroll 1 (EV6) • Programs: • 1D-Tiling for the Cache Level: loop J, TS = 4 (EV5), TS=8 (EV6) • 2D -Tiling for the Cache Level: TSIxJ = 32x16 (EV5), TSIxJ=40x12(EV6) • 1D-Tiling for the register level: loop J, TS=4 (EV5 & EV6) EV5 1519s 1533s 1023s 999s 1009s 677s (execution time) EV6 439s 658s 294s 371s 578s 296s Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT

Performance Results EV5 (SWIM) • Architecture: EV56 (500Mhz, L1:8KB, L2:96KB) • Compiler invocations: • base: kf77 -O5 -arch ev56 • no_prefetch: kf77 -O5 -arch ev56 -switch nolu_prefetch_fetch ….. Speedup over ORI (base) Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT

Performance Results EV6 (SWIM) • Architecture: EV6(500MHz, L1:64KB, L2:4MB) • Compiler invocations: • base: f77 -O5 -arch ev6 • no_prefetch: f77 -O5 -arch ev6 -switch nolu_prefetch_fetch ….. Speedup over ORI (base) Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT

Code for Result Verification DO K = 2, ITMAX-1 ... do J = 1,N ... enddo result verification IF (MOD(K,MPRINT).eq.0) THEN do I = do J = UCHECK = UCHECK + {UNEW(I,J)} enddo UNEW (I,I) = . . . enddo PRINTS ENDIF do J = 1,N ... enddo ENDDO J c NEW in SPEC2000!! • Apply strip-mining to loop K (only useful if MPRINT is large)

Loop Tiling for Iterative Stencil Computations