570 likes | 677 Views
Processor Architectures and Program Mapping. Data Memory Management Part b: Loop transformations & Data Reuse. 5KK70 TU/e Henk Corporaal Bart Mesman. Thanks to the IMEC DTSE experts:. Erik Brockmeyer IMEC, Leuven, Belgium and also
E N D
Processor Architectures and Program Mapping Data Memory Management Part b: Loop transformations & Data Reuse 5KK70 TU/e Henk Corporaal Bart Mesman
Thanks to the IMEC DTSE experts: Erik Brockmeyer IMEC, Leuven, Belgium and also Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda, Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.
DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out @HC 5KK70 Platform-based Design
Location Production Consumption Time Location Production Consumption Time Locality of Reference for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]); @HC 5KK70 Platform-based Design
Location Production Consumption Time Location Time Regularity for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[i] = f(A[7-i]); for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); @HC 5KK70 Platform-based Design
Location Consumption Consumption Time Location Consumption Consumption Time Enabling Reuse for (i=0; i < 8; i++) B[i] = f1(A[i]); for (i=0; i < 8; i++) C[i] = f2(A[i]); for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]); @HC 5KK70 Platform-based Design
How to do these loop transformations automatically? • Requires cost function • Requires technique Let's introduce some terminology • iteration spaces • polytopes • ordering vector, which determines the execution order @HC 5KK70 Platform-based Design
// assume A[][] exists for (i=1; i<6; i++) { for (j=2; j<6; j++) { B[i][j]= g( A[i-1][j-2]); } } Iteration space and polytopes i 5 4 3 2 1 --- iteration space --- consumption space --- production space --- dependency vector 0 j 0 1 2 3 4 5 @HC 5KK70 Platform-based Design
C B A Example with 3 polytopes Algorithm having 3 loops: A: for (i=1; i<=N; ++i) for (j=1; j<=N-i+1; ++j) a[i][j] = in[i][j] + a[i-1][j]; B: for (p=1; p<=N; ++p) b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: for (k=1; k<=N; ++k) for (l=1; l<=k; ++k) b[k][l+1] = g (b[k][l]); l k p i j @HC 5KK70 Platform-based Design
Common iteration space for (i=1; i<=(2*N+1); ++i) for (j=1; j<=2*N; ++j) if (i>=1 && i<=N && j>=1 && j<=N-i+1) a[i][j] = in[i][j] + a[i-1][j]; if (i==N+1 && j>=1 && j<=N) b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k) b[i-N-1][j-N+1] = g (b[i-N-1][j-N]); • Initial solution having a common iteration space: • Bad locality • Bad regularity • Requires 2N memory locations • Many dummy iterations 2*N+1 i 1 1 2*N j Ordering vector @HC 5KK70 Platform-based Design
Cost function needed for automation • Regularity • Equal direction for dependency vectors • Avoid that dependency vectors cross each other • Good for storage size • Temporal locality • Equal length of all dependency vectors • Good for storage size • Good for data reuse @HC 5KK70 Platform-based Design
Regularity Regular Irregular @HC 5KK70 Platform-based Design
Bad regularity limits the ordering freedom 2*N+1 i 1 1 2*N j Ordering freedom = 90 degrees @HC 5KK70 Platform-based Design
C C P C P C C C C C C Locality estimates: a few options Sum{di} Max {di} Spanning tree C di P C C C P = production C = consumption Dependency vector length is measure for locality Q: Which length is the best estimate? @HC 5KK70 Platform-based Design
Three step approach for loop transformation tool • Affine loop transformations • Rotation, skewing, interchange, reverse • Only geometric information is needed • Polytope placement • Translation • Only geometric information is needed • Choose ordering vector • Generate the code Combined transformation: @HC 5KK70 Platform-based Design
j i p k l Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j]; B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] ); @HC 5KK70 Platform-based Design
Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector @HC 5KK70 Platform-based Design
Three step approach for loop transformation tool • Affine loop transformations • Polytope placement = merging loops • Choose ordering vector @HC 5KK70 Platform-based Design
Choose optimal ordering vector Ordering Vector 1 Ordering Vector 2 @HC 5KK70 Platform-based Design
j i l From the Polyhedral model back to C • Affine loop transformations • Polytope placement • Choose ordering vector for (j=1; j<=N; ++j) { for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] ); } • Optimized solution having a common iteration space: • Optimal locality • Optimal regularity • Requires 2 memory locations @HC 5KK70 Platform-based Design
Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size @HC 5KK70 Platform-based Design
1 Transform: interchange Translate: merge 2 Order 3 Loop trafo-cavity (1) @HC 5KK70 Platform-based Design
1 Transform: interchange Translate: merge 2 Choose Order 3 Loop trafo-cavity (2) x-blur filter: @HC 5KK70 Platform-based Design
Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size @HC 5KK70 Platform-based Design
2 2 Translate 1: Translate 2: 3 Loop trafo-cavity (3) Comparing different translations @HC 5KK70 Platform-based Design
Order 3 3 Loop trafo-cavity (4) Combining (merging) multiple polytopes + = @HC 5KK70 Platform-based Design
Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x<=N-1-GB && y>=GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot; } else if (x<N && y<M) gauss_x_image[x][y] = 0; if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; @HC 5KK70 Platform-based Design
Intermezzo • Before we continue with data reuse, have a look at other loop transformations • check the discussed slides !! @HC 5KK70 Platform-based Design
DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out @HC 5KK70 Platform-based Design
Layer 3 Layer 2 Data paths Layer 1 Memory hierarchy and Data reuse • Determines reuse candidates • Combine reuse candidates into reuse chains • If multiple access statements/array combine into reuse trees • Determine number of layers (if architecture is not fixed) • Select candidates and assign to memory layers • Add extra transfers between the different memory layers(for scratchpad RAM; not for caches) @HC 5KK70 Platform-based Design
TI C55@200MHz example platform L2 Offchip Fixed size RAM partition BW: 50M Word/s single port MAX: 8MBx16 Size 16 MB SRAM/EPROM/ SDRAM/SBSRAM Bandwidth 50M words/s ROM partition L1 ROM (Data/program/DMA) Size 32kB 16Kx16 first 3 cycles, next 2 cycles ROM Bandwidth 100M words/S It seems this can be in parallel with the 256Kb memory BW: 400M Word/s dual port 32x Total 256Kb 4Kx16 4Kx16 4Kx16 Variable size RAM partition sing sing sing 1 elem in 1 cycle Size 320kB Bandwidth 400M words/s 8x Total 64Kb 4Kx16 4Kx16 4Kx16 2 elem in 1 cycle dual dual dual Processor partition L0 Size 2x16 registers Register file + Core TMS320vc5510@200MHz Bandwidth 4.8Gwords/s Vdd= 1.5 V P = unknown @HC 5KK70 Platform-based Design
#A = 100% M P = 1 P total (before) = 100% Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File A P = 1 @HC 5KK70 Platform-based Design
100% 100% A’’ A’ A’ M A M A 10% 5% 1% P = 0.01 P = 0.3 P = 0.1 P = 1 P = 1 P = 1 P = 1 Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File P total (before) = 100% P total (after) = 100%x0.01+10%x0.1+1%x1 = 3% @HC 5KK70 Platform-based Design
customized connections A’’ A’ Data reuse decision and memory hierarchy: principle Processor Data Paths Processor Data Paths Register File Register File M B A Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead. @HC 5KK70 Platform-based Design
copy2 copy1 copy3 copy4 Time frame 2 Time frame 1 Time frame 4 Time frame 3 Step 1: identify arrays with data reuse potential for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; intra-copy reuse array index inter-copy reuse time @HC 5KK70 Platform-based Design
Importance of high level cost estimate for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Array copies are stored in-place! Mk 6 time copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 3 Time frame 4 @HC 5KK70 Platform-based Design
j iterator =not present so intra-copy reuse 3 intra-copy reuse factor= 3 copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Intra-copy reuse factor for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time @HC 5KK70 Platform-based Design
for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; inter-copy reuse factor = 1/(1-1/3)=3/2 i iterator has smaller weight than k range so inter-copy reuse copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Inter-copy reuse factor for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time @HC 5KK70 Platform-based Design
Mk 15 Mm Mm tf 1.1 tf 2.3 tf 1.3 tf 1.2 tf 2.2 tf 1.4 tf 2.1 tf 1.5 tf 1.6 tf 1 tf 2 tf 4 tf 5 tf 9 tf 7 tf 3 tf 8 tf 6 5 5 time frame 1 time frame 2 Possibility for multi-level hierarchy for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; array index time @HC 5KK70 Platform-based Design
A A A A Many reuse possibilities A’ Prune for promising ones A’ A’ Cost estimate needed A’’ R1(A) R1(A) R1(A) R1(A) Step 2: determine data reuse chains for each memory access @HC 5KK70 Platform-based Design
100 80 60 #misses 40 estimate size 20 Gk 0 15 0 5 10 15 20 Gm #elements R1(A) 5 A’ A’ Cost function needs both size and number of accesses to intermediate array estimate #misses from different levels for one iteration of i for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; 2*3*5 =30 3*5 =15 2*3*3*5 =90 @HC 5KK70 Platform-based Design
51 38 155 165 135 150 35 170 A 30 15 90 15 A A A 45 22 135 22 150 150 150 150 30 15 15 120 A’ 6 45 105 A’ A’ 5 7 16 15 15 30 120 A’’ accesses size energy 6 x 5 y 90 90 90 90 z R1(A) R1(A) R1(A) R1(A) Very simplistic power and area estimation for different data-reuse versions @HC 5KK70 Platform-based Design
A A’ R2(A) Step 3: determine data reuse trees for multiple accesses for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; A A’ for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y]; A’’ R1(A) @HC 5KK70 Platform-based Design
Reuse tree A A A’ A’ R2(A) A’’ R1(A) Step 3: determine data reuse trees for multiple accesses A A’ A’ R2(A) A’’ R1(A) @HC 5KK70 Platform-based Design
Hierarchy layers Layer1 Layer2 Layer3 Foreground mem. Datapath Step 4: Determine number of layers Data reuse trees B Data reuse trees A @HC 5KK70 Platform-based Design
all 1 3 2 4 5 A A A FG FG Step 5: Select and assign reuse candidates hierarchy assignments Hierarchy layers Data reuse trees @HC 5KK70 Platform-based Design
Data reuse trees B Step 5: All freedom in array to memory hierarchy Hierarchy layers Data reuse trees A @HC 5KK70 Platform-based Design
Hierarchy layers Pruned Step 5: Prune reuse graph (platform independent) Hierarchy layers Full freedom Quite some solutions never make sense @HC 5KK70 Platform-based Design
Hierarchy layers Pruned Final solution 4 layer platform Final solution 4 layer platform A A' B B' FG FG Step 5: Prune reuse graph further (platform dependent) @HC 5KK70 Platform-based Design
A A B Layer 1 B A’ B’ A’ Layer 2 B’ A’ R2(A) B’’ A’’ B’’’ Layer 3 A’’ A’ B’’’ R1(A) R1(B) R1(A) R2(A) R1(B) Assign all data reuse trees (multiple arrays) to memory hierarchy @HC 5KK70 Platform-based Design