Data-Flow Analysis in the Memory Management of Real-Time Multimedia Processing Systems

Data-Flow Analysis in the Memory Management of Real-Time Multimedia Processing Systems Florin Balasa Universityof Illinois at Chicago

Introduction Real-time multimedia processing systems (video and image processing, real-time 3D rendering, audio and speech coding, medical imaging, etc.) • A large part of power dissipation is due to data transfer and data storage Fetching operands from an off-chip memory for addition consumes 33 times more power than the computation [ Catthoor 98 ] • Area cost often largely dominated by memories

Introduction In the early years of high-level synthesis memory management tasks tackled at scalar level Algebraic techniques -- similar to those used in modern compilers -- allow to handle memory management atnon-scalar level Requirement: addressing the entire class of affine specifications • multidimensional signals with (complex) affine indexes • loop nests having as boundaries affine iterator functions • conditions – relational and / or logical operators of affine fct.

Outline • Memory size computation using data dependence analysis • Hierarchical memory allocation based on data reuse analysis • Data-flow driven data partitioning for on/off- chip memories • Conclusions

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … for (k=0; k<=511; k++) … B [i+k] [j+k] … How many memory locations are necessary to store the array references A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of iterator triplets (i,j,k), that is 5123 ?? (i,j,k)=(0,1,1) No !! B [1] [2] (i,j,k)=(1,2,0)

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of index values (i+k,j+k), that is 10232 ?? (since 0 <= i+k , j+k <= 1022) any (i,j,k) No !! B [0] [512]

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … z=4i+6j+3 A[x][y][z] j y=5i+j+2 i Iterator space Index space x=2i+3j+1

Computation of array reference size … A [2i+3j+1] [5i+j+2] [4i+6j+3] … j A[x][y][z] i Iterator space Index space

Computation of array reference size Remark The iterator space may have ``holes’’ too for (i=4; i<=8; i++) for (j=i-2; j<=i+2; j+=2) … C[i+j] … j for (i=4; i<=8; i++) 8 for (j=0; j<=2; j++) … C[2i+2j-2] … 6 j 4 normalization 2 2 1 i i 0 4 6 8 4 5 6 7 8

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … 2 3 1 x i + 5 1 2 y = j 4 6 3 z affine Index space Iterator space mapping 0 <= i , j <= 511

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … k y=j+k B[x][y] j i x=i+k Index space Iterator space

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … i x 0 1 0 1 j + = y 1 0 1 0 k affine Index space Iterator space mapping 0 <= i , j , k <= 511

Computation of array reference size Any array reference can be modeled as a linearly bounded lattice (LBL) LBL = { x = T·i + u | A·i >= b } Affine mapping Iterator space - scope of nested loops, and • iterator-dependent conditions affine LBL Polytope mapping

Computation of array reference size The size of the array reference is the size of its index space – an LBL !! LBL = { x = T·i + u | A·i >= b } f : Zn Zm f(i) = T·i + u Is function f a one-to-one mapping ?? If YES Size(index space) = Size(iterator space)

Computation of array reference size f : Zn Zm f(i) = T·i + u 0 H P·T·S = [Minoux 86] 0 G H - nonsingular lower-triangular matrix S - unimodular matrix P - row permutation When rank(H)=m <= n , H is the Hermite Normal Form

Computation of array reference size rank(H)=n function f is a one-to-one mapping Case 1 for (i=0; i<=511; i++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … for (j=0; j<=511; j++) 2 3 1 x i + 5 1 2 y = j 4 6 3 z 2 3 1 0 -1 3 H P·T·S= I3 = 5 1 -4 13 1 -2 - - - - 4 6 2 0 G Nr. locations A[ ][ ][ ] = size ( 0 <= i,j <= 511 ) = 512 x 512

Computation of array reference size rank(H)<n Case 2 for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … 1 0 -1 1 0 0 1 0 1 P·T·S= I2 = 0 1 -1 0 0 1 1 0 1 0 0 1 0 H { 0 <= i , j , k <= 511 } { 0 <= I-K , J-K , K <= 511 } | B[i+k][j+k] | = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897

Computation of array reference size Array reference B [i+k] [j+k]

Computation of array reference size Computation of the size of an integer polytope The Fourier-Motzkin elimination n-dim polytope 1. xn >= Di (x1,…,xn-1)  aikxk >= bk 2. xn <= Ej(x1,…,xn-1) 3. 0 <= Fk (x1,…,xn-1) (n-1)-dim polytope Di (x1,…,xn-1) <= Ej (x1,…,xn-1) 0 <= Fk (x1,…,xn-1) for each value of x1 1-dim polytope add size (n-1)-dim polytope Range of x1

Memory size computation # define n 6 for ( j=0; j<n ; j++ ) { A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1; } for ( i=0; i<n ; i++ ) { alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; } for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];

LBL LBL1 LBL2 U LBL1 = { x = T1·i1 + u1 | A1·i1 >= b1 } LBL2 = { x = T2·i2 + u2 | A2·i2 >= b2 } T1·i1 + u1 = T2·i2 + u2 { A1·i1 >= b1 , A2·i2 >= b2 } Diophantine system of eqs. New polytope Memory size computation Decompose the LBL’s of the array refs. into non-overlapping pieces !!

Memory size computation Keeping minimal the set of inequalities in the LBL intersection for ( i=0; i<n ; i++ ) for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; Iterator space { 0 <= i , j <= n-1 , j+1 <= i } j (5 ineq.) n-1 { 0 <= j , i <= n-1 , j+1 <= i } i (3 ineq.) n-1 1

Memory size computation Keeping minimal the set of inequalities in the LBL intersection The decomposition theorem of polyhedra Polyhedron = { x | C·x = d , A·x >= b } [ Motzkin 1953 ] Polyhedron = { x | x = V·a + L·b+ R·g } a , g >= 0 , S ai =1

Memory size computation LBL’s of signal A (illustrative example)

Granularity level = 0 Granularity level = 1 Polyhedral data-dependence graphs

Granularity level = 2 Scalar-level data-dependence graph

Polyhedral data-dependence graph # scalars motion detection algorithm [Chan 93] # dependencies

Memory size computation Memory size variation during the motion detection alg.

Memory size computation To handle high throughput applications Extract the (largely hidden) parallelism from the initially specified code Find the lowest degree of parallelism to meet the throughput/hardware requirements Perform memory size computation for code with explicit parallelism instructions

Hierarchical memory allocation A large part of power dissipation in data-dominated applications is due to data transfers and data storage Power cost reduction memory hierarchy exploiting temporal locality in the data accesses Power dissipation = f ( memory size , access frequency )

Hierarchical memory allocation Power dissipation = f ( memory size , access freq. ) heavily used data Layer of small memories Layer of large memories

Hierarchical memory allocation Hierarchical distribution Non-hierarchical distribution Lower power consumption by accessing from smaller memories trade-offs Higher power consumption due to additional transfers • to store copies of data Larger area • additional area overhead (addressing logic)

Hierarchical memory allocation Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Which intermediate copies of data are necessary for accessing data in a power- and area- efficient way 2. Memory allocation & assignment Distributed (hierarchical) memory architecture ( memory layers, memory size/ports/address-logic , signal-to-memory & signal-to-port assignment )

Hierarchical memory allocation Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Array partitions to be considered as copy candidates: the LBL’s from the recursive intersection of array refs. 2. Memory allocation & assignment Cost = a · SPread / write ( N bits , N words , f read / write ) + b· S Area ( N bits , N words , Nports , technology)

Partitioning for on/off- chip memories 1 cycle DRAM off-chip SRAM on-chip CPU Memory address space 10-20 Cache 1 cycle cycles Optimal data mapping to the SRAM / DRAM to maximize the performance of the application

Partitioning for on/off- chip memories Total number of array accesses exposed to cache conflicts Total conflict factor The importance of mapping to the on-chip SRAM Using the polyhedral data-dependence graph Precise info about the relative lifetimes of the different parts of arrays

Conclusions • Algebraic techniques are powerful non-scalar instruments in the memory management of multimedia signal processing • Data-dependence analysis at polyhedral level useful in many memory management tasks memory size computation for behavioral specifications hierarchical memory allocation data partitioning between on- and off- chip memories The End

Data-Flow Analysis in the Memory Management of Real-Time Multimedia Processing Systems