500 likes | 629 Views
Embedded Computer Architecture. Data Management Part c: SCBD, MAA, and Data Layout. 5KK73 TU/e Henk Corporaal. Part 3 overview. Recap on design flow Platform dependent steps SCBD: Storage Cycle Budget Distribution MAA: Memory Allocation and Assignment Data layout techniques for RAM
E N D
Embedded Computer Architecture Data Management Part c: SCBD, MAA, and Data Layout 5KK73 TU/e Henk Corporaal
Part 3 overview • Recap on design flow • Platform dependent steps • SCBD: Storage Cycle Budget Distribution • MAA: Memory Allocation and Assignment • Data layout techniques for RAM • Data layout techniques for Caches • Results • Conclusions Thanks to the IMEC DTSE people Embedded Computer Architecture 5KK73 @H.C.
Concurrent OO spec Remove OO overhead Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SW/HW co-design SW design flow HW design flow DM Design flow Embedded Computer Architecture 5KK73 @H.C.
C-in DM steps Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Today Memory allocation and assignment Data layout Address optimization C-out Embedded Computer Architecture 5KK73 @H.C.
L2 L1 L0 Result of Memory hierarchy assignment for cavity detection image_in image_out gauss_xy comp_edge gauss_x N*M N*M 1MB SDRAM 0 N*M M*3 M*3 M*3 16KB Cache N*M*3 N*M N*M N*M*3 N*M*3 128 B RegFile 1*1 1*1 3*1 3*3 3*3 N*M*3 N*M*8 N*M*8 N*M*8 N*M*8 Embedded Computer Architecture 5KK73 @H.C.
Data-reuse - cavity detection code Code after reuse transformation (partly) for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x<=N-2 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) // 3x1 filter gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Embedded Computer Architecture 5KK73 @H.C.
Storage Cycle Budget Distribution &Memory Allocation and Assignment
Define the memory organization which can provide enough bandwidth with minimal cost Embedded Computer Architecture 5KK73 @H.C.
Memory Bandwidth Required High time Memory Bandwidth Required Low time Balancing memory bandwidth Reduce max. number of loads/store per cycle: Embedded Computer Architecture 5KK73 @H.C.
Data management approach One of the many possible schedules • Idea: find a schedule which • fits in the number of cycles (= budget) • reduces the number of ports • avoids multi-ported memories Embedded Computer Architecture 5KK73 @H.C.
Data management approach; details Embedded Computer Architecture 5KK73 @H.C.
Conflict cost calculation Key issues: • Number of conflicts • Self conflicts • Chromatic number = size of maximum clique Embedded Computer Architecture 5KK73 @H.C.
Self conflict dual port memory Reschedule Embedded Computer Architecture 5KK73 @H.C.
Chromatic number minimum # single port memories Reschedule Embedded Computer Architecture 5KK73 @H.C.
Final A valid Conflict Graph Schedule Memory Configuration A B A B B C A B C D C D D A C D A B C One solution A B A B A B A B C D C D D C A B C D C D A B Multiple solutions C Lower number of conflicts larger assignment freedom Reschedule Embedded Computer Architecture 5KK73 @H.C.
R(A) R(B) R(C) R(D) R(A) W(B) W(D) Conflict Directed Ordering is used to find a good schedule time slots • Reduce intervals until all conflicts known • Driven by cost of conflicts • Constructive algorithm 1 2 3 4 5 6 R(A) R(A) W(A) W(C) R(B) W(B) W(A) R(C) W(B) R(C) W(B) R(C) ? W(C) R(D) W(D) Embedded Computer Architecture 5KK73 @H.C.
Local optimization is not good for global optimization Embedded Computer Architecture 5KK73 @H.C.
Budget distribution has large impact on memory cost Embedded Computer Architecture 5KK73 @H.C.
Decreasing basic block length until target cycle budget is met Embedded Computer Architecture 5KK73 @H.C.
What's the effect of merging loops? • More scheduling freedom !! Reschedule Embedded Computer Architecture 5KK73 @H.C.
Memory allocation and assignment Embedded Computer Architecture 5KK73 @H.C.
Memory Allocation 1 2 3 A Array-to-memory Assignment D B C A Port Assignment Bus Sharing D B C Memory Allocation and Assignment Substeps Allocation = Select number and type of memories Embedded Computer Architecture 5KK73 @H.C.
Influence of MAA MEMORY-1 MEMORY-N Bitwidth 0101110010 Bitwidth K 1001XXXXXX 1001001110101001 A (maximum) (maximum) L Size 100100111010XXXX Size B Nr. ports (R/W/RW) Nr. ports (R/W/RW) • Bit width • Address range • Nr. memories • Nr. ports • Assign arrays to memory • Memory interconnect • Minimize power & Area Embedded Computer Architecture 5KK73 @H.C.
R(A) R(B) R(B) W(A) W(C) R(A) R(A) W(B) W(A) W(B) W(A) W(C) m1 m2 m3 m1 m2 m3 m1 m2 m3 A C B A C B A C B X X X Example of bus sharing possibilities Given Schedule Embedded Computer Architecture 5KK73 @H.C.
Decreasing cycle budget limits freedom and raises cost Embedded Computer Architecture 5KK73 @H.C.
Minimum Budget Self conflict, Sequential forcing dual port mem. Budget Conflict graph changed, but no impact on assignment Conflict graph changed, change in assignment Example: Resulting Pareto curve for DAB synchro application Energy cost Embedded Computer Architecture 5KK73 @H.C.
Example conflict graph for cavity detection Embedded Computer Architecture 5KK73 @H.C.
MAA result Power: On-chip area: Embedded Computer Architecture 5KK73 @H.C.
Data layouthow to put data into memory Embedded Computer Architecture 5KK73 @H.C.
? B' C B A' B C A B C A ? CACHE A PE C B MEM1 MEM1 G PE H F MEM2 Memory data layout forcustom and cache architectures ? ? B' A' PE CACHE ? G PE ? F H MEM2 Embedded Computer Architecture 5KK73 @H.C.
aij memory addresses max nr. of life elements This number depends on the layout !! Compare e.g. row major and column major ordering. time Intra-array in-place mappingreduces size of one array j for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 Window i Embedded Computer Architecture 5KK73 @H.C.
abstract addresses real addresses aA a A Storage order Allocation aB B aC Two-phase mapping of array elements onto addresses array domains C Embedded Computer Architecture 5KK73 @H.C.
a1 a2 a=3a1+a2 a=3a1+(2-a2) a=3(1-a1)+a2 a=3(1-a1)+(2-a2) a=2a2+a1 a=2(2-a2)+a1 a=2a2+(1-a1) a=2(2-a2)+(1-a1) Exploration of storage ordersfor 2-dimensional array: 8 options memory address variable domain a=??? a ? ? ? ? ? ? Embedded Computer Architecture 5KK73 @H.C.
i for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); j row-major ordering: a=5i+j column-major: a=5j+i for (i=1; i<5; i++) for (i=1; i<5; i++) for (j=0; j<5; j++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); a[5*i+j] = f(a[5*i+j-5]); 5*4+i-1 5*i+j Highest live address: 5*0+i-1 Lowest live address: 5*i+j-5 21 Difference + 1= Window: 6 Chosen storage order determines window size Embedded Computer Architecture 5KK73 @H.C.
aA aB A A D B B Memory Size aC aD C C D E aE E Static allocation:no in-place mapping time Embedded Computer Architecture 5KK73 @H.C.
Dynamic, windowed A A D D C C Memory Size B B E E Windowed Allocation:intra-array in-place mapping Static, windowed WA Memory Size Embedded Computer Architecture 5KK73 @H.C.
aA aB A B B Memory Size aC aD C C A D D E aE E Dynamic allocation:inter-array in-place mapping Embedded Computer Architecture 5KK73 @H.C.
B A A D D C C B Memory Size E E Dynamic allocation strategy with common window Dynamic, common window Embedded Computer Architecture 5KK73 @H.C.
Expressing memory data layoutin source code Example: array of 10x20 elements A: offset 120, no windowB: storage order [20, 2], offset 134, window 78 Before: bit8 B[10][20];bit6 A[30];for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334];bit8* B =(bit8*)&memory[134];bit6* A =(bit6*)&memory[120];for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Embedded Computer Architecture 5KK73 @H.C.
Example of memory data layoutfor storage size reduction int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Embedded Computer Architecture 5KK73 @H.C.
Occupied address-time domainof x[] and y[] Embedded Computer Architecture 5KK73 @H.C.
Optimized source codeafter memory data layout int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Embedded Computer Architecture 5KK73 @H.C.
Optimized OAT domainafter memory data layout Embedded Computer Architecture 5KK73 @H.C.
index address Image_out time Image index time Image_in time In-place mapping for cavity detection example • Input image is partly consumed by the time first results for output image are ready Embedded Computer Architecture 5KK73 @H.C.
In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; } } Embedded Computer Architecture 5KK73 @H.C.
Cavity detection summary Overall result: • Local accesses reduced by factor 3 • Memory size reduced by factor 5 • Power reduced by factor 5 • System bus load reduced by factor 12 • Performance worsened by factor 6 Embedded Computer Architecture 5KK73 @H.C.
The last step: ADOPT (Address OPTimization) • Increased execution time introduced by DTSE • Complicated address arithmetic (modulo: a%b) • Additional complex control flow • Additional transformations needed to • Simplify control flow • Simplify address arithmetic: common sub-expression elimination, modulo expansion, … • Match remaining expressions on target machine Embedded Computer Architecture 5KK73 @H.C.
ADOPT principles • How to avoid % in address expressions, likeint A[7];for (i=0; i<… ; i++) … A[i % 7] • Increase buffer size to power of 2i % 8 => i && 0x07 • Use if-statementint A[7];for (i=0,j=0; i<… ; i++,j++) … A[j] if (j==8) j=0 Embedded Computer Architecture 5KK73 @H.C.
for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { B[ ] = A[ ]; }} dist += A[ ]- B[ ]; } ADOPT principles: CSE Example: Full-search Motion Estimation - applying Common Subexpression Elimination (CSE) for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)*257+ 16+i+k] = B[(8+j)*257+16+i+k]; } dist += A[3096] - B[((208+i)*257+4)*257+ 16+i-4]; } cse1 = (33025*i+6869616)*2; cse3 = 1040+i; cse4 = j*257+1032; cse5 = k+cse4; cse5+cse1 = cse5+cse3 3096 cse1 Algebraic transformations at word-level Embedded Computer Architecture 5KK73 @H.C.
Conclusion on Data Management • In multi-media applications exploring data transfer and storage issues should be done at source code level • DMM method • Reducing number of external memory accesses • Reducing external memory size • Trade-offs between internal memory complexity and speed • Platform independent high-level transformations • Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) • Substantial energy reduction Embedded Computer Architecture 5KK73 @H.C.