Data Memory Management in Embedded Systems

ASCI Winterschool on Embedded SystemsMarch 2004Renesse Data Memory Management Henk Corporaal Peter Knijnenburg

Data Memory Management Overview • Motivation • Example application • DMM steps • Results Notes: • We concentrate on Static Data Memory Management‘ • The Data Transfer and Storage Exploration (DTSE)methodology, on which these slides are based, has been developed by IMEC, Leuven

SDRAM SDRAM Serial I/O video-in B[j] = A[i*4+k]; B[j] = A[i*4+k]; B[j] = A[i*4+k]; PCI bridge video-out timers I2C I/O Data storage bottleneck audio-out I$ VLIW cpu audio-in Data transfer bottleneck D$ D$ The underlying idea for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k];

Platform architecture model Level-2 Level-3 Level-4 Level-1 SCSI bus bus bus Chip on-chip busses bus-if bridge SCSI Disk L2 Cache ICache CPUs DCache Main Memory Disk HW accel Local Memory Local Memory Disk Local Memory

Power(memory) = 33 Power(arithmetic) Data transfer and storage power

Data transfer and data storage specific rewrites in the application code Positioning in the Y-chart Architecture Instance Applications Applications Applications Mapping Performance Analysis Performance Numbers

Mapping • Given • architecture SuperDuperXYZ • reference C code for applicatione.g. MPEG-4 Motion Estimation • Task • map application on this architecture • But … wait a moment me@work> sdcc -o mpeg4_me mpeg4_me.cThank you for running SuperDuperXYZ compiler.Your program uses 257321886 bytes memory, 78 Watt, 428798765291 clock cycles Let’s help the compiler

Application domain: Computer Tomography in medical imaging Algorithm: Cavity detection in CT-scans Detect dark regions in successive images Indicate cavity in brain Bad news for owner of brain Application example

Application Max Value Reference (conceptual) C code for the algorithm • all functions: image_in[N x M]t-1 -> image_out[N x M]t • new value of pixel depends on its neighbors • neighbor pixels read from background memory • approximately 110 lines of C code (ignoring file I/O etc) • experiments with N x M = 640 x 400 pixels • straightforward implementation: 6 image buffers Gauss Blur x Gauss Blur y Compute Edges Reverse Detect Roots

Avoid N-port Memories within real-time constraints local latch 1 & bank 1 Processor Data Paths L1 cache L2 cache Cache Bank Combine local latch N & bank N Introduce Locality Reduce redundant transfers Exploit memory hierarchy DMM principles Off-chip SDRAM Exploit limited life-time

C-in DMM steps Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout Address expression optimization C-out

The DMM steps • Preprocessing • Rewrite code in 3 layers (parts) • Selective inlining, Single Assignment form, .... • Data flow transformations • Eliminate redundant transfers and storage • Loop and control flow transformations • Improve regularity of accesses and data locality • Data re-use and memory hierarchy layer assignment • Determine when to move which data between memories to meet the cycle budget of the application with low cost • Determine in which layer to put the arrays (and copies)

The DMM steps Per memory layer: • Cycle budget distribution • determine memory access constraints for given cycle budget • Memory allocation and assignment • which memories to use, and where to put the arrays • Data layout • determine how to combine and put arrays into memories • Address expression optimizations

Preprocessing: Dividing an application in the 3 layers Module1a LAYER1 Module2 Module3 Module1b - testbench call - dynamic event behaviour Synchronisation - mode selection for (i=0;i<N; i++) for (j=0; j<M; j++) LAYER2 if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]); int func1(int a, int b) LAYER3 { return a*b; }

Layered code structure main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out); } void cav_detect() { /* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } } }

N M N-2 M-2 Data-flow trafo - cavity detection for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0; for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } } } #accesses: N * M + (N-2) * (M-2)

N M M-2 N-2 Data-flow trafo - cavity detection for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } } } #accesses: N * M gain is ± 50 %

Data-flow transformation • In total 5 types of data-flow transformations: • advanced signal substitution and propagation • algebraic transformations (associativity etc.) • shifting “delay lines” • re-computation • transformations to eliminate bottlenecksfor subsequent loop transformations

Data-flow transformation - result

for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1 storage size N Loop transformations • Loop transformations • improve regularity of accesses • improve temporal locality: production  consumption • Expected influence • reduce temporary storage and (anticipated) background storage

scandevice Data enters Cavity Detectorrow-wise serial scan Buffer =image_in GaussBlur loop Cavity Detector

N x M Gauss Blur x X-Y Loop Interchange Loop trafo - cavity detection N x M Scanner X Y From double bufferto single buffer

Loop trafo - cavity detection N x (2GB+1) N x 3 Gauss Blur x Gauss Blur y Compute Edges Repeated fold and loop merge 3(offset arrays) 2GB+1 From N x M toN x (3) buffer size From N x M toN x (2GB+1) buffer size

Improve regularity and locality Loop Merging !! Impossible due to dependencies! for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 2nd filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (x=0;x<N;x++) /* 2nd filtering code */

Data dependencies between1st and 2nd loop for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0;y<M;y++) for (x=0;x<N;x++) … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] …

Enable merging withLoop Folding (bumping) for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0+GB;y<M+GB;y++) for (x=0;x<N;x++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] …

Y-loop merging on 1st and 2nd loop nest for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = … if (y>=GB) for (x=0;x<N;x++) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else

Simplify conditionsin merged loop nest for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = … for (x=0;x<N;x++) if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB)

End result of global loop trafo for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; …

Loop transformations - result

M’’ M’ M Main memory P = 0.01 P = 0.1 P = 1 Data re-use & memory hierarchy • Introduce memory hierarchy • reduce number of reads from main memory • heavily accessed arrays stored in smaller memories Processor Data Paths Reg File #A = 100 100 10 1 P (original) = # access x power/access = 100 P (after) = 100 x 0.01 + 10 x 0.1 + 1 x 1 = 3

int[2][6] A;for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i][k]; array index (6 * i + k) iterations Data re-use • Data flow transformations to introduce extracopies of heavily accessed signals • Step 1: figure out data re-use possibilities • Step 2: calculate possible gain • Step 3: decide on data assignment to memory hierarchy

array index 6*2 6*1 N*2*3*6 iterations frame1 frame2 frame3 CPU Data re-use • Data flow transformations to introduce extracopies of heavily accessed signals • Step 1: figure out data re-use possibilities • Step 2: calculate possible gain • Step 3: decide on data assignment to memory hierarchy 1*2*1*6 N*2*1*6

Data re-use tree image_in gauss_xy/comp_edge gauss_x image_out N*M M*3 M*3 M*3 N*M N*M N*M*3 N*M*3 N*M 0 1*1 N*1 3*3 1*3 N*M N*M*8 N*M*8 N*M*3 3*1 N*M*3 CPU CPU CPU CPU CPU

L3 L2 L1 Memory hierarchy assignment image_in image_out gauss_xy comp_edge gauss_x N*M N*M 1MB SDRAM 0 N*M M*3 M*3 M*3 16KB Cache N*M*3 N*M N*M N*M*3 N*M*3 128 B RegFile 1*1 1*1 3*1 3*3 3*3 N*M*3 N*M*8 N*M*8 N*M*8 N*M*8

Data-reuse - cavity detection code Code before reuse transformation for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x<=N-2 && y>=1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y]= foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ } }

Data-reuse - cavity code Code after reuse transformation detection for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3][y%1] = image_in[x+k][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3][y%1] = image_in[x+1][y]; if (x>=1 && x<=N-1-1 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3][y%1]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0;

Data reuse & memory hierarchy

Data layout optimization • At this point multi-dimensional arraysare to be assigned to physical memories • Data layout optimization determines exactly where in each memory an array should be placed, to • reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes) • to avoid cache misses due to conflicts • exploit spatial locality of the data in memory to improve performance of e.g. page-mode memory access sequences

A A A A D D B B D D C C C C B B E E E E In-place mapping Inter in-place addresses Intra in-place time

In-place mapping - results

The last step: ADOPT (Address OPTimization) • Increased execution time introduced by DMM • Complicated address arithmetic (modulo!) • Additional complex control flow • Additional transformations needed to • Simplify control flow • Simplify address arithmetic: common sub-expression elimination, modulo expansion, … • Match remaining expressions on target machine

for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { Ad[ ] = A[ ]; }} dist += A[ ]-Ad[ ]; } ADOPT example From Full-search Motion Estimation for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)*257+ 16+i+k] = B[(8+j)*257+16+i+k]; } dist += A[3096] - B[((208+i)*257+4)*257+ 16+i-4]; } cse1 = (33025*i+6869616)*2; cse3 = 1040+i; cse4 = j*257+1032; cse5 = k+cse4; cse5+cse1 = cse5+cse3 3096 cse1 Algebraic transformations at word-level

Address optimization - result

Conclusion • In embedded applications exploring data transfer and storage issues should be done at system level • DTSE is a methodology for Data Transfer and Storage Exploration based on manual and/or tool-assisted code rewriting • Platform independent high-level transformations • Platform dependent transformations exploit platform characteristics (efficient use of cache, local memories) • Substantial reduction in power and memory size demonstrated on MPEG-4, OFDM, H.263, ADSL, ...

Data Memory Management in Embedded Systems

Data Memory Management in Embedded Systems

Presentation Transcript

Embedded Systems

EMBEDDED SYSTEMS

Embedded Systems

Embedded Systems

Embedded Systems

Embedded Systems

ASCI Winterschool on Embedded Systems March 2004 Renesse

Embedded Systems

Embedded Systems

Embedded Systems

ASCI Winterschool on Embedded Systems March 2004 Renesse

Embedded Systems

ASCI Winterschool on Embedded Systems March 2004 Renesse

EMBEDDED SYSTEMS

Voter Registration 27 th October, 2004, ASCI