Processor Architectures and Program Mapping

Processor Architectures and Program Mapping Data Memory Management Part b: Loop transformations & Data Reuse 5KK70 TU/e Henk Corporaal Bart Mesman

Thanks to the IMEC DTSE experts: Erik Brockmeyer IMEC, Leuven, Belgium and also Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda, Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.

DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out @HC 5KK70 Platform-based Design

Location Production Consumption Time Location Production Consumption Time Locality of Reference for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]); @HC 5KK70 Platform-based Design

Location Production Consumption Time Location Time Regularity for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[i] = f(A[7-i]); for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); @HC 5KK70 Platform-based Design

Location Consumption Consumption Time Location Consumption Consumption Time Enabling Reuse for (i=0; i < 8; i++) B[i] = f1(A[i]); for (i=0; i < 8; i++) C[i] = f2(A[i]); for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]); @HC 5KK70 Platform-based Design

How to do these loop transformations automatically? • Requires cost function • Requires technique Let's introduce some terminology • iteration spaces • polytopes • ordering vector / execution order @HC 5KK70 Platform-based Design

// assume A[][] exists for (i=1; i<6; i++) { for (j=2; j<6; j++) { B[i][j]= g( A[i-1][j-2]); } } Iteration space and polytopes i 5 4 3 2 1 --- iteration space --- consumption space --- production space --- dependency vector 0 j 0 1 2 3 4 5 @HC 5KK70 Platform-based Design

C B A Example with 3 polytopes Algorithm having 3 loops: A: for (i=1; i<=N; ++i) for (j=1; j<=N-i+1; ++j) a[i][j] = in[i][j] + a[i-1][j]; B: for (p=1; p<=N; ++p) b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: for (k=1; k<=N; ++k) for (l=1; l<=k; ++k) b[k][l+1] = g (b[k][l]); l k p i j @HC 5KK70 Platform-based Design

Common iteration space for (i=1; i<=(2*N+1); ++i) for (j=1; j<=2*N; ++j) if (i>=1 && i<=N && j>=1 && j<=N-i+1) a[i][j] = in[i][j] + a[i-1][j]; if (i==N+1 && j>=1 && j<=N) b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k) b[i-N-1][j-N+1] = g (b[i-N-1][j-N]); • Initial solution having a common iteration space: • Bad locality • Bad regularity • Requires 2N memory locations • Many dummy iterations 2*N+1 i 1 1 2*N j Ordering vector @HC 5KK70 Platform-based Design

Cost function needed for automation • Regularity • Equal direction for dependency vectors • Avoid that dependency vectors cross each other • Good for storage size • Temporal locality • Equal length of all dependency vectors • Good for storage size • Good for data reuse @HC 5KK70 Platform-based Design

Regularity Regular Irregular @HC 5KK70 Platform-based Design

Bad regularity limits the ordering freedom 2*N+1 i 1 1 2*N j Ordering freedom = 90 degrees @HC 5KK70 Platform-based Design

C C P C P C C C C C C Locality estimates Sum{di} Max {di} Spanning tree C di P C C C P = production C = consumption Dependency vector length is measure for locality Q: Which length is the best estimate? @HC 5KK70 Platform-based Design

Three step approach for loop transformation tool • Affine loop transformations • Only geometric information is available during placement • Rotation, skewing, interchange, reverse • Polytope placement • Only geometric information is available during placement • Translation • Choose ordering vector Combined transformation: @HC 5KK70 Platform-based Design

j i p k l Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j]; B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] ); @HC 5KK70 Platform-based Design

Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector @HC 5KK70 Platform-based Design

Three step approach for loop transformation tool • Affine loop transformations • Polytope placement = merging loops • Choose ordering vector @HC 5KK70 Platform-based Design

Choose optimal ordering vector Ordering Vector 1 Ordering Vector 2 @HC 5KK70 Platform-based Design

j i l From the Polyhedral model back to C • Affine loop transformations • Polytope placement • Choose ordering vector for (j=1; j<=N; ++j) { for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] ); } • Optimized solution having a common iteration space: • Optimal locality • Optimal regularity • Requires 2 memory locations @HC 5KK70 Platform-based Design

Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size @HC 5KK70 Platform-based Design

1 Transform: interchange Translate: merge 2 Order 3 Loop trafo-cavity (1) @HC 5KK70 Platform-based Design

1 Transform: interchange Translate: merge 2 Order 3 Loop trafo-cavity (2) x-blur filter: @HC 5KK70 Platform-based Design

Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size @HC 5KK70 Platform-based Design

2 2 Translate 1: Translate 2: 3 Loop trafo-cavity (3) Comparing different translations @HC 5KK70 Platform-based Design

Order 3 3 Loop trafo-cavity (4) Combining (merging) multiple polytopes + = @HC 5KK70 Platform-based Design

Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x<=N-1-GB && y>=GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot; } else if (x<N && y<M) gauss_x_image[x][y] = 0; if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; @HC 5KK70 Platform-based Design

Intermezzo • Before we continue with data reuse, have a look at other loop transformations @HC 5KK70 Platform-based Design

DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out @HC 5KK70 Platform-based Design

Layer 3 Layer 2 Data paths Layer 1 Memory hierarchy and Data reuse • Determines reuse candidates • Combine reuse candidates into reuse chains • If multiple access statements/array combine into reuse trees • Determine number of layers (if architecture is not fixed) • Select candidates and assign to memory layers • Add extra transfers between the different memory layers(for scratchpad RAM; not for caches) @HC 5KK70 Platform-based Design

TI C55@200MHz example platform L2 Offchip Fixed size RAM partition BW: 50M Word/s single port MAX: 8MBx16 Size 16 MB SRAM/EPROM/ SDRAM/SBSRAM Bandwidth 50M words/s ROM partition L1 ROM (Data/program/DMA) Size 32kB 16Kx16 first 3 cycles, next 2 cycles ROM Bandwidth 100M words/S It seems this can be in parallel with the 256Kb memory BW: 400M Word/s dual port 32x Total 256Kb 4Kx16 4Kx16 4Kx16 Variable size RAM partition sing sing sing 1 elem in 1 cycle Size 320kB Bandwidth 400M words/s 8x Total 64Kb 4Kx16 4Kx16 4Kx16 2 elem in 1 cycle dual dual dual Processor partition L0 Size 2x16 registers Register file + Core TMS320vc5510@200MHz Bandwidth 4.8Gwords/s Vdd= 1.5 V P = unknown @HC 5KK70 Platform-based Design

#A = 100% M P = 1 P total (before) = 100% Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File A P = 1 @HC 5KK70 Platform-based Design

100% 100% A’’ A’ A’ M A M A 10% 5% 1% P = 0.01 P = 0.3 P = 0.1 P = 1 P = 1 P = 1 P = 1 Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File P total (before) = 100% P total (after) = 100%x0.01+10%x0.1+1%x1 = 3% @HC 5KK70 Platform-based Design

customized connections A’’ A’ Data reuse decision and memory hierarchy: principle Processor Data Paths Processor Data Paths Register File Register File M B A Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead. @HC 5KK70 Platform-based Design

copy2 copy1 copy3 copy4 Time frame 2 Time frame 1 Time frame 4 Time frame 3 Step 1: identify arrays with data reuse potential for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; intra-copy reuse array index inter-copy reuse time @HC 5KK70 Platform-based Design

Importance of high level cost estimate for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Array copies are stored in-place! Mk 6 time copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 3 Time frame 4 @HC 5KK70 Platform-based Design

j iterator =not present so intra-copy reuse 3 intra-copy reuse factor= 3 copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Intra-copy reuse factor for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time @HC 5KK70 Platform-based Design

for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; inter-copy reuse factor = 1/(1-1/3)=3/2 i iterator has smaller weight than k range so inter-copy reuse copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Inter-copy reuse factor for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time @HC 5KK70 Platform-based Design

Mk 15 Mm Mm tf 1.1 tf 2.3 tf 1.3 tf 1.2 tf 2.2 tf 1.4 tf 2.1 tf 1.5 tf 1.6 tf 1 tf 2 tf 4 tf 5 tf 9 tf 7 tf 3 tf 8 tf 6 5 5 time frame 1 time frame 2 Possibility for multi-level hierarchy for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; array index time @HC 5KK70 Platform-based Design

A A A A Many reuse possibilities A’ Prune for promising ones A’ A’ Cost estimate needed A’’ R1(A) R1(A) R1(A) R1(A) Step 2: determine data reuse chains for each memory access @HC 5KK70 Platform-based Design

100 80 60 #misses 40 estimate size 20 Gk 0 15 0 5 10 15 20 Gm #elements R1(A) 5 A’ A’ Cost function needs both size and number of accesses to intermediate array estimate #misses from different levels for one iteration of i for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; 2*3*5 =30 3*5 =15 2*3*3*5 =90 @HC 5KK70 Platform-based Design

51 38 155 165 135 150 35 170 A 30 15 90 15 A A A 45 22 135 22 150 150 150 150 30 15 15 120 A’ 6 45 105 A’ A’ 5 7 16 15 15 30 120 A’’ accesses size energy 6 x 5 y 90 90 90 90 z R1(A) R1(A) R1(A) R1(A) Very simplistic power and area estimation for different data-reuse versions @HC 5KK70 Platform-based Design

A A’ for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y]; R2(A) Step 3: determine data reuse trees for multiple accesses for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; A A’ A’’ R1(A) @HC 5KK70 Platform-based Design

Reuse tree A A A’ A’ R2(A) A’’ R1(A) Step 3: determine data reuse trees for multiple accesses A A’ A’ R2(A) A’’ R1(A) @HC 5KK70 Platform-based Design

Hierarchy layers Layer1 Layer2 Layer3 Foreground mem. Datapath Step 4: Determine number of layers Data reuse trees B Data reuse trees A @HC 5KK70 Platform-based Design

all 1 3 2 4 5 A A A FG FG Step 5: Select and assign reuse candidates hierarchy assignments Hierarchy layers Data reuse trees @HC 5KK70 Platform-based Design

Data reuse trees B Step 5: All freedom in array to memory hierarchy Hierarchy layers Data reuse trees A @HC 5KK70 Platform-based Design

Hierarchy layers Pruned Step 5: Prune reuse graph (platform independent) Hierarchy layers Full freedom Quite some solutions never make sense @HC 5KK70 Platform-based Design

Hierarchy layers Pruned Final solution 4 layer platform Final solution 4 layer platform A A' B B' FG FG Step 5: Prune reuse graph further (platform dependent) @HC 5KK70 Platform-based Design

A A B Layer 1 B A’ B’ A’ Layer 2 B’ A’ R2(A) B’’ A’’ B’’’ Layer 3 A’’ A’ B’’’ R1(A) R1(B) R1(A) R2(A) R1(B) Assign all data reuse trees (multiple arrays) to memory hierarchy @HC 5KK70 Platform-based Design

Processor Architectures and Program Mapping