480 likes | 603 Views
Embedded Computer Architecture 2. Block A Introduction. The organisation of the course. 8 sessions Examination Lecturers Jan Kuper (Zilverling 4102, telephone: 3785) j.kuper@utwente.nl Andr é Kokkeler (Zilverling 4096, telephone: 4291) Course Material Powerpoint presentations on Teletop
E N D
Embedded Computer Architecture 2 Block A Introduction
The organisation of the course • 8 sessions • Examination • Lecturers • Jan Kuper (Zilverling 4102, telephone: 3785) j.kuper@utwente.nl • André Kokkeler (Zilverling 4096, telephone: 4291) • Course Material • Powerpoint presentations on Teletop • Presentations will differ from last year
The aims of the course • Show the relation between the algorithm and the architecture. • Derive the architecture from the algorithm (if possible) by means of transformations • Derive architectural requirements from the algorithm (by means of transformations)
The aims of the course Applications Algorithms subset ECA2 Transformations Analysis methods Design Architecture +/- 4 sessions +/- 4 sessions Platform (HW/SW)
The design process A design description may express: • Behavior: Expresses the relation between the input and the output value-streams of the system • Structure: Describes how the system is decomposed into subsystems and how these subsystems are connected • Geometry: Describes where the different parts are located.
Abstraction levels Behavior Geometry Structure Application Algorithm Basic operator Boolean logic Physical level Board level Layout Cell Block level Processing element Basic block Transistor
Specification overloading Specification overloading means that the specification gives a possibly unwanted implementation suggestion, i.e. the behavioral specification expresses structure In practice: A behavioral specification always contains structure.
2 x + z a b 2 x + z a x b Example: same function same behavior, different expressions different structure different designs suggests: and suggests:
Our focus • Array processors. • Systolic arrays. • Architectures for embedded algorithms s.a. digital signal processing algorithms.
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Array processor An array processor is a structure in which identical processing elements are arranged regularly 1 dimension 2 dimensions
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Array processor 3 dimensions
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Systolic array In a systolic array processor all communication path contain at least one unit delay (register). is register or delay Delay constraints are local. Therefore unlimited extension without changing the cells
Array Processors • Can be approached from: • Application • Algorithm • Architecture • Technology • We will focus on • Algorithm Architecture • Derive the architecture from the algorithm
Array processors: Application areas • Speech processing • Image processing (video, medical ,.....) • Radar • Weather • Medical signal processing • Geology • . . . . . . . . . . . Many simple calculations on a lot of data in a short time General purpose processors do not provide sufficient processing power
Example video processing • 1000 operations per pixel (is not that much) • 1024 x 1024 pixels per frame (high density TV) • 50 frames per second (100 Hz TV) • 50 G operations per second • < 1 Watt available • Pentium 2Ghz: 2G operations per second • > 30 Watt • required 25 Pentiums 750 Watt
Description of the algorithms • In practice the algorithms are described (specified) in: • some programming language. • In our (toy) examples we use: • programming languages • algebraic descriptions
Examples of algorithms we will use: Filter: Matrix algebra: Transformations like Fourier transform Z transform Sorting . . . .
Graphs • Graphs are applicable for describing • behavior • structure • Dependency graphs • consist of: • nodes expressing operations or functions • edges expressing data dependencies or • the flow of data • Graphs are our vehicles to describe the design flow from • Algorithm to architecture
Design flow idea program (imperative) single assignment code (functional) recurrent relations dependency graph signal flow graph
8 Example: Sorting: the idea < empty place needed 10 9 8 5 3 2 1 12 < 8 8 5 2 1 3 10 9 12 shifted one position
8 8 9 3 3 1 8 9 6 1 3 3 8 9 9 9 6 8 6 3 3 1 8 9 9 8 3 6 3 1 8 9 mj-1 mj-1 mj-1 mj-1 mj mj mj mj mj+1 mj+1 mj+1 mj+1 y x y y := mj x x y mj:= x x y x:= y
Sorting: inserting one element if (x>= m[j]) { y = m[j]; m[j] = x; x = y; } if (x>= m[j]) swap(m[j],x); Identical descriptions of swapping m[j],x = MaxMin(m[j],x); Inserting an element into a sorted array of i elements such that the order is preserved: m[i] = -infinite for(j = 0; j < i+1; j++) { m[j],x = MaxMin(m[j],x); }
Sorting: The program Sorting N elements in an array is composed from N times inserting an element into a sorted array of N elements such that the order is preserved. An empty array is ordered. int in[0:N-1], x[0:N-1], m[0:N-1]; for(int i = 0; i < N; i++) { x[i] = in[i]; m[i] = - infinite; } input body for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[j],x[i] = MaxMin(m[j],x[i]);} } output for(int j = 0; j < N; j++) { out[j] = m[j];}
Sorting: Towards ‘Single assignment’ • Single assignment: • Each scalar variable is assigned only once • Why? • Goal is a data dependency graph • - nodes expressing operations or functions • - edges expressing data dependencies or • the flow of data
Sorting: Towards ‘Single assignment’ Single assignment: Each scalar variable is assigned only once Why? Code Nodes Graph x=a+b; x=c*d; a x + b How do you connect these? c x * d
Sorting: Towards ‘Single assignment’ Single assignment: Each scalar variable is assigned only once Why? Code x=a+b; x=c*d; Description already optimized towards implementation: memory optimization. But, fundamentally you produce two different values, e.g. x1 an x2
Sorting: The program Sorting N elements in an array is composed from N times inserting an element into a sorted array of N elements such that the order is preserved. An empty array is ordered. int in[0:N-1], x[0:N-1], m[0:N-1]; for(int i = 0; i < N; i++) { x[i] = in[i]; m[i] = - infinite; } input body for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[j],x[i] = MaxMin(m[j],x[i]);} }
hence, for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[i,j],x[i] = MaxMin(m[i-1,j],x[i]);} } Sorting: Towards ‘Single assignment’ Single assignment: Each scalar variable is assigned only once Start with m[j]: m[j] at loop index i depends on the value at loop index i-1 for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[j],x[i] = MaxMin(m[j],x[i]);} }
hence, for(int i = 0; i < N; i++) { for(j = 0; j < i+1; i++) { m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]);} } Sorting: Towards ‘Single assignment’ x[i] at loop index j depends on the value at loop index j-1 for(int i = 0; i < N; i++) { for(j = 0; j < i+1; i++) { m[i,j],x[i] = MaxMin(m[i-1,j],x[i]);} }
Sorting: The algorithm in ‘single assignment’ input int in[0:N-1], x[0:N-1,-1:N-1], m[-1:N-1,0:N-1]; for(int i = 0; i < N; i++) { x[i,-1] = in[i]; m[i-1,i] = - infinite; } body for(int i = 0; i < N; i++) { for(j = 0; j < i+1; j++) { m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]);} } output for(int j = 0; j < N; j++) { out[j] = m[N-1,j];} All scalar variables are assigned only once. The algorithm satisfies the single assignment property
Sorting: Recurrent relation A description in single assignment can be directly translated into a recurrent relation in[0:N-1], out[0:N-1], x[0:N-1, -1:N-1], m[-1:N-1, 0:N-1]; declaration x[i,-1] = in[i] m[i-1,i] = - infinite input m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]) body out[j] = m[N-1,j] output 0 <= i < N; 0 <= j < i+1 } area Notice that the order of these relations is arbitrary
j m[i-1,j] x[i,j-1] x[i,j] MaxMin m[i,j] i Sorting: Body in two dimensions m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]) body The body is executed for all i and j. Hence two dimensions
j i Variable naming and index assignment A variable associated to an arrow gets the indices of the processing element that delivers its value. ci-1,j bi-1,j-1 ai,j-1 ai,j PEi,j ( i , j ) bi,j ci,j vi,j PEi,j Local constants get the indices of the processing element that they are in
j m[i-1,j] 1 x[i,j] 0 x[i,j-1] i 1 0 m[i,j] Sorting: Body implementation body m[i,j],x[i,j] = MaxMin(m[i-1,j],x[i,j-1]) if( m[i-1,j] <= x[i,j-1]) { m[i,j] = x[i,j-1]; x[i,j] = m[i-1,j]; } else { m[i,j] = m[i-1,j]; x[i,j] = x[i,j-1]); }
j m[2,3]= m[1,2]= m[0,1]= m[-1,0]= i Sorting: Implementation N = 4 -1 0 1 2 3 -1 PE = MaxMin x[0,-1] PE 0 PE PE x[1,-1] 1 PE PE PE x[2,-1] 2 PE PE PE PE x[3,-1] 3 m[3,0] m[3,1] m[3,2] m[3,3]
Sorting: Example N = 4 3 1 5 2 5 3 2 1
Dependency Graphs and Signal Flow Graphs • The array processor described: • the way in which the processors are • arranged and • the way in which the data is communicated • between the processing elements. PE PE PE PE PE PE PE PE PE PE PE PE Hence, the graph describes the dependencies of the data that is communicated, or said differently: The graph describes the way in which the data values at the outputs of a processing element depend on the data at the outputs of the other processing elements. So we may consider it as a Dependency Graph
Recurrent relations For simple algorithms the transformation from single assignment code to a recurrent relation is simple. • Questions to answer: • How do recurrent relations influence the dependency graph • How can recurrent relations be manipulated such that the behavior remains the same and the structure of the dependency graph is changed We will answer these questions by means of an example: Matrix-Vector multiplication
Matrix Vector multiplication Recurrent relations: Alternative (because is associative)
Matrix Vector multiplication The basic cell is described by: We have two indices i and j, so the dependency graph can be described as a two-dimensional array j bj ai,j bj x si,j si,j-1 si,j-1 si,j PE + i
b0 b1 b2 j s0,-1 S0,0 s0,1 s0,2=c0 0 PE PE PE s1,0 0 PE PE s1,2=c1 PE s2,0 s2,2=c2 0 PE PE PE s3,0 s3,-1 i 0 s3,2=c3 PE PE PE DG-1 of the Matrix Vector multiplication (K = 4) (N = 3) b0, b1 and b2 are global dependencies. Therefore this graph is called a Globally recursive Graph
j i DG-2 of the Matrix Vector multiplication b0 b1 b2 s0,1 s0,2 s0,3 c0=s0,0 0 PE PE PE s1,1 c1=s1,0 0 PE PE (K = 4) PE (N = 3) s2,1 c2=s2,0 0 PE PE PE s3,1 s3,3 c3=s3,0 0 PE PE PE
Equation results in Equation results in Recurrent relations: Conclusion The associative operations and result in two different recurrent relations and thus in two different dependency graphs. Other associative operations are for example ‘AND’ and ‘OR’.
å - N 1 = c a . b i i , j j = j 0 Changing global data dependencies into local data dependencies Global data dependencies resist manipulating the dependency graph j bj Global data dependencies ci i bj Local data dependencies di-1,j ci si,j
b0=d-1,0 b1=d-1,1 b2=d-1,2 s0,-1 s0,0 s0,1 s0,2=c0 0 PE PE PE d0,0 d0,1 s1,0 0 PE PE s1,2=c1 PE d1,0 s2,0 å - N 1 = c a . b s2,2=c2 0 PE PE PE i i , j j = j 0 s3,0 s3,-1 0 s3,2=c3 PE PE PE Changing global data dependencies into local data dependencies So the matrix-vector multiplications becomes: Relations: (K = 4) (N = 3) Locally recursive graph
å - N 1 = c a . b i i , j j = j 0 Alternative transformation from global data dependencies to local data dependencies bj Global data dependencies ci Local data dependencies di,j ci si,j bj
s0,-1 s0,0 s0,1 s0,2=c0 0 PE PE PE d1,0 d1,1 s1,0 0 PE PE s1,2=c1 PE d2,0 å - N 1 s2,0 = c a . b s2,2=c2 0 PE PE PE i i , j j = j 0 s3,0 s3,-1 0 s3,2=c3 PE PE PE b2=d4,2 b0=d4,0 b1=d4,1 Changing global data dependencies into local data dependencies So the alternative locally recursive graph becomes: Relations: (K = 4) (N = 3)
Associative operations give two alternative DG’s. Transformation from global to local dependencies gives two alternative DG’s. Input, output and intermediate edges will be treated separately. Dependency Graphs Conclusions: