Processor Architectures and Program Mapping

Processor Architectures and Program Mapping Data Memory Management Part b: Loop transformations & Data Reuse 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

Thanks to the IMEC DTSE experts: Erik Brockmeyer IMEC, Leuven, Belgium and also Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda, Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.

DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out H.C. TD5102

Location Production Consumption Time Location Production Consumption Time Locality of Reference for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]); H.C. TD5102

Location Production Consumption Time Location Time Regularity for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[i] = f(A[7-i]); for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); H.C. TD5102

Location Consumption Consumption Time Location Consumption Consumption Time Enabling Reuse for (i=0; i < 8; i++) B[i] = f1(A[i]); for (i=0; i < 8; i++) C[i] = f2(A[i]); for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]); H.C. TD5102

How to do these loop transformations automatically? • Requires cost function • Requires technique Let's introduce some terminology • iteration spaces • polytopes • ordering vector / execution order H.C. TD5102

// assume A[][] exists for (i=1; i<6; i++) { for (j=2; j<6; j++) { B[i][j]= g( A[i-1][j-2]); } } Iteration space and polytopes i 5 4 3 2 1 --- iteration space --- consumption space --- production space --- dependency vector 0 j 0 1 2 3 4 5 H.C. TD5102

C B A Example with 3 polytopes Algorithm having 3 loops: A: for (i=1; i<=N; ++i) for (j=1; j<=N-i+1; ++j) a[i][j] = in[i][j] + a[i-1][j]; B: for (p=1; p<=N; ++p) b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: for (k=1; k<=N; ++k) for (l=1; l<=k; ++k) b[k][l+1] = g (b[k][l]); l k p i j H.C. TD5102

Common iteration space for (i=1; i<=(2*N+1); ++i) for (j=1; j<=2*N; ++j) if (i>=1 && i<=N && j>=1 && j<=N-i+1) a[i][j] = in[i][j] + a[i-1][j]; if (i==N+1 && j>=1 && j<=N) b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k) b[i-N-1][j-N+1] = g (b[i-N-1][j-N]); • Initial solution having a common iteration space: • Bad locality • Bad regularity • Requires 2N memory locations • Many dummy iterations 2*N+1 i 1 1 2*N j Ordering vector H.C. TD5102

Cost function needed for automation • Regularity • Equal direction for dependency vectors • Avoid that dependency vectors cross each other • Good for storage size • Temporal locality • Equal length of all dependency vectors • Good for storage size • Good for data reuse H.C. TD5102

Regularity Regular Irregular H.C. TD5102

Bad regularity limits the ordering freedom 2*N+1 i 1 1 2*N j Ordering freedom = 90 degrees H.C. TD5102

C C P C P C C C C C C Locality estimates Sum{di} Max {di} Spanning tree C di P C C C P = production C = consumption Dependency vector length is measure for locality Q: Which length is the best estimate? H.C. TD5102

Three step approach for loop transformation tool • Affine loop transformations • Only geometric information is available during placement • Rotation, skewing, interchange, reverse • Polytope placement • Only geometric information is available during placement • Translation • Choose ordering vector Combined transformation: H.C. TD5102

j i p k l Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j]; B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] ); H.C. TD5102

Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector H.C. TD5102

Three step approach for loop transformation tool • Affine loop transformations • Polytope placement = merging loops • Choose ordering vector H.C. TD5102

Choose optimal ordering vector Ordering Vector 1 Ordering Vector 2 H.C. TD5102

j i l From the Polyhedral model back to C • Affine loop transformations • Polytope placement • Choose ordering vector for (j=1; j<=N; ++j) { for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] ); } • Optimized solution having a common iteration space: • Optimal locality • Optimal regularity • Requires 2 memory locations H.C. TD5102

Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size H.C. TD5102

1 Transform: interchange Translate: merge 2 Order 3 Loop trafo-cavity (1) H.C. TD5102

1 Transform: interchange Translate: merge 2 Order 3 Loop trafo-cavity (2) x-blur filter: H.C. TD5102

Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size H.C. TD5102

2 2 Translate 1: Translate 2: 3 Loop trafo-cavity (3) Comparing different translations H.C. TD5102

Order 3 3 Loop trafo-cavity (4) Combining (merging) multiple polytopes + = H.C. TD5102

Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x<=N-1-GB && y>=GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot; } else if (x<N && y<M) gauss_x_image[x][y] = 0; if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; H.C. TD5102

Intermezzo • Before we continue with data reuse, have a look at other loop transformations H.C. TD5102

DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out H.C. TD5102

Layer 3 Layer 2 Data paths Layer 1 Memory hierarchy and Data reuse • Determines reuse candidates • Combine reuse candidates into reuse chains • If multiple access statements/array combine into reuse trees • Determine number of layers (if architecture is not fixed) • Select candidates and assign to memory layers • Add extra transfers between the different memory layers(for scratchpad RAM; not for caches) H.C. TD5102

TI C55@200MHz example platform L2 Offchip Fixed size RAM partition BW: 50M Word/s single port MAX: 8MBx16 Size 16 MB SRAM/EPROM/ SDRAM/SBSRAM Bandwidth 50M words/s ROM partition L1 ROM (Data/program/DMA) Size 32kB 16Kx16 first 3 cycles, next 2 cycles ROM Bandwidth 100M words/S It seems this can be in parallel with the 256Kb memory BW: 400M Word/s dual port 32x Total 256Kb 4Kx16 4Kx16 4Kx16 Variable size RAM partition sing sing sing 1 elem in 1 cycle Size 320kB Bandwidth 400M words/s 8x Total 64Kb 4Kx16 4Kx16 4Kx16 2 elem in 1 cycle dual dual dual Processor partition L0 Size 2x16 registers Register file + Core TMS320vc5510@200MHz Bandwidth 4.8Gwords/s Vdd= 1.5 V P = unknown H.C. TD5102

#A = 100% M P = 1 P total (before) = 100% Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File A P = 1 H.C. TD5102

100% 100% A’’ A’ A’ M A M A 10% 5% 1% P = 0.01 P = 0.3 P = 0.1 P = 1 P = 1 P = 1 P = 1 Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File P total (before) = 100% P total (after) = 100%x0.01+10%x0.1+1%x1 = 3% H.C. TD5102

customized connections A’’ A’ Data reuse decision and memory hierarchy: principle Processor Data Paths Processor Data Paths Register File Register File M B A Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead. H.C. TD5102

copy2 copy1 copy3 copy4 Time frame 2 Time frame 1 Time frame 4 Time frame 3 Step 1: identify arrays with data reuse potential for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; intra-copy reuse array index inter-copy reuse time H.C. TD5102

Importance of high level cost estimate for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Array copies are stored in-place! Mk 6 time copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 3 Time frame 4 H.C. TD5102

j iterator =not present so intra-copy reuse 3 intra-copy reuse factor= 3 copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Intra-copy reuse factor for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time H.C. TD5102

for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; inter-copy reuse factor = 1/(1-1/3)=3/2 i iterator has smaller weight than k range so inter-copy reuse copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Inter-copy reuse factor for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time H.C. TD5102

Mk 15 Mm Mm tf 1.1 tf 2.3 tf 1.3 tf 1.2 tf 2.2 tf 1.4 tf 2.1 tf 1.5 tf 1.6 tf 1 tf 2 tf 4 tf 5 tf 9 tf 7 tf 3 tf 8 tf 6 5 5 time frame 1 time frame 2 Possibility for multi-level hierarchy for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; array index time H.C. TD5102

A A A A Many reuse possibilities A’ Prune for promising ones A’ A’ Cost estimate needed A’’ R1(A) R1(A) R1(A) R1(A) Step 2: determine data reuse chains for each memory access H.C. TD5102

100 80 60 #misses 40 estimate size 20 Gk 0 15 0 5 10 15 20 Gm #elements R1(A) 5 A’ A’ Cost function needs both size and number of accesses to intermediate array estimate #misses from different levels for one iteration of i for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; 2*3*5 =30 3*5 =15 2*3*3*5 =90 H.C. TD5102

51 38 155 165 135 150 35 170 A 30 15 90 15 A A A 45 22 135 22 150 150 150 150 30 15 15 120 A’ 6 45 105 A’ A’ 5 7 16 15 15 30 120 A’’ accesses size energy 6 x 5 y 90 90 90 90 z R1(A) R1(A) R1(A) R1(A) Very simplistic power and area estimation for different data-reuse versions H.C. TD5102

A A’ for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y]; R2(A) Step 3: determine data reuse trees for multiple accesses for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; A A’ A’’ R1(A) H.C. TD5102

Reuse tree A A A’ A’ R2(A) A’’ R1(A) Step 3: determine data reuse trees for multiple accesses A A’ A’ R2(A) A’’ R1(A) H.C. TD5102

A A B Layer 1 B A’ B’ A’ Layer 2 B’ A’ R2(A) B’’ A’’ B’’’ Layer 3 A’’ A’ B’’’ R1(A) R1(B) R1(A) R2(A) R1(B) Assign all data reuse trees (multiple arrays) to memory hierarchy H.C. TD5102

Hierarchy layers Layer1 Layer2 Layer3 Foreground mem. Datapath Step 4: Determine number of layers Data reuse trees B Data reuse trees A H.C. TD5102

all 1 3 2 4 5 A A A FG FG Step 5: Select and assign reuse candidates hierarchy assignments Hierarchy layers Data reuse trees H.C. TD5102

Data reuse trees B Step 5: All freedom in array to memory hierarchy Hierarchy layers Data reuse trees A H.C. TD5102

Hierarchy layers Pruned Step 5: Prune reuse graph (platform independent) Hierarchy layers Full freedom Quite some solutions never make sense H.C. TD5102

Hierarchy layers Pruned Final solution 4 layer platform Final solution 4 layer platform A A' B B' FG FG Step 5: Prune reuse graph further (platform dependent) H.C. TD5102

Processor Architectures and Program Mapping