300 likes | 432 Views
Efficient Radar Processing Via Array and Index Algebras. Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry B. Hunt III, Xingmin Luo University at Albany, SUNY NSF CCR 0105536. Outline. Overview Motivation Radar Software Processing: to exceed 1 x 10 12 ops/second
E N D
Efficient Radar Processing Via Array and Index Algebras Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry B. Hunt III, Xingmin Luo University at Albany, SUNY NSF CCR 0105536
Outline • Overview • Motivation • Radar Software Processing: to exceed 1 x 1012 ops/second • The Mapping Problem: • Efficient Use of Memory Hierarchy, Portable, Scalable, … • Radar uses Linear and Multi-linear Operators: • Array Based Operations • Array Operations Require Array Algebra and Index Calculus • Array Algebra: MoA and Index Calculus: Psi Calculus • Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension • High-Level Monolithic Operations: Remove Temporaries • Time Domain Convolution • Benefits of Using MoA and Psi Calculus
Levels of Processor/Memory Hierarchy • Can be Modeled by Increasing Dimensionality of Data Array. • Additional dimension for each level of the hierarchy. • Envision data as reshaped to reflect increased dimensionality. • Calculus automatically transforms algorithm to reflect reshaped data array. • Data, layout, data movement, and scalarization automatically generated based on reshaped data array.
Levels of Processor/Memory Hierarchycontinued • Math and indexing operations in same expression • Framework for design space search • Rigorous and provably correct • Extensible to complex architectures Mathematics of Arrays y= conv (x) Map Approach Example: “raising” array dimensionality < 0 1 2 > x: < 0 1 2 … 35 > < 3 4 5 > P0 Main Memory < 6 7 8 > < 9 10 11 > L2 Cache < 12 13 14 > L1 Cache < 15 16 17 > P1 Memory Hierarchy Map: < 18 19 20 > < 21 22 23 > < 24 25 26 > < 27 28 29 > P2 Parallelism < 30 31 32 > < 33 34 35 >
Application DomainSignal Processing Pulse Compression Doppler Filtering Beamforming Detection 3-d Radar Data Processing Composition of Monolithic Array Operations Convolution Matrix Multiply Change algorithmto better match hardware/memory/communication. Lift dimension algebraically • Hardware Info: • - Memory • - Processor Algorithm is Input Architectural Information is Input Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3
Current Abstraction Approaches Some Modern Programming Languages with Monolithic Arrays C++ w/classesfunctions, templates Fortran 95 ZPL MATLAB PETE AST Preprocessor Loop transformationsTheories Grammar Changes CompilerAST OptimizationsGrammar Changes StandardCompilerOptimizations Interpreted Compiled PartialAlgebras Requireshighly skilledprogrammers Blas, Linpack, LAPACK, SCALAPACK ATLAS PVL, Blitz++,MTL Libraries Libraries High Performance Fine Tune Scalable/Portable Classical CompilerTechnology &Optimization Even when operations compose, they don’t compose, X(YZ) without temporary arrays
Outline • Overview • Array Algebra: MoA and Index Calculus: Psi Calculus • Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension • High-Level Monolithic Operations: Remove Temporaries • Time Domain Convolution • Benefits of Using MoA and Psi Calculus
PSI Calculus • Basic Properties: • Index calculus: Centers around psi function. • Shape polymorphic functions and operators: • Operations are defined using shapes and psi. • Fundamental type is the array modeled as (shape_vector, components). • scalars are 0-dimensional arrays, that is: (empty_vector, scalar value). • Denotational Normal Form(DNF) = reduced form in Cartesian coordinates (independent of data layout: row major, column major, regular sparse, …) • Operational Normal Form(ONF) = reduced form for 1-d memory layout(s).
Psi Reduction This becomes by “psi” Reduction A=cat(rev(B), rev(C)) A[i]=B[B.size-1-i] if 0≤ i < B.size A[i]=C[C.size+B.size-1-i] if B.size ≤ i < B.size+C.size) ONF has minimum number of reads/writes PSI Calculus rules applied mechanically to produce ONF which is easily translated to optimal loop implementation
Convolution: PSI Calculus Description < 0 0 1 2 . . . > < 0 1 2 3 . . . > < 1 2 3 4 . . . > < 0 0 1 > < 0 1 2 > < 1 2 3 > < 0 0 7 > < 0 6 14 > < 5 12 21 > Definition of y=conv(h,x) y[n]= where x‘has N elements, h has M elements, 0≤n<N+M-1, andx’ is x padded by M-1 zeros on either end Psi Calculus Algorithm step Algorithm and PSI Calculus Description x= < 1 2 3 4 > h= < 5 6 7 > Initial step x= < 1 2 3 4 > h= < 5 6 7 > Form x’ x’= x’=cat(reshape(<k-1>, <0>), cat(x, reshape(<k-1>,<0>)))= < 0 0 1 . . . 4 0 0 > rotate x’ (N+M-1) times x’ rot= x’ rot=binaryOmega(rotate,0,iota(N+M-1), 1 x’) take the size of h part of x’rot x’ final= x’ final=binaryOmega(take,0,reshape<N+M-1>,<M>,=1,x’ rot Prod= multiply Prod=binaryOmega (*,1, h,1,x’ final) sum Y=unaryOmega (sum, 1, Prod) < 7 20 38 . . . > Y= PSI Calculus operators compose to form higher level operations
Experimental Platform and Method Hardware • DY4 CHAMP-AV Board • Contains 4 MPC7400’s and 1 MPC 8420 • MPC7400 (G4) • 450 MHz • 32 KB L1 data cache • 2 MB L2 cache • 64 MB memory/processor Software • VxWorks 5.2 • Real-time OS • GCC 2.95.4 (non-official release) • GCC 2.95.3 with patches for VxWorks • Optimization flags: -O3 -funroll-loops -fstrict-aliasing Method • Run many iterations, report average, minimum, maximum time • From 10,000,000 iterations for small data sizes, to 1000 for large data sizes • All approaches run on same data • Only average times shown here • Only one G4 processor used • Use of the VxWorks OS resulted in very low variability in timing • High degree of confidence in results
Experiment: Conv(x,h) • Cost of temporaries in regular C++ approach more pronounced due to large number of operations • Cost of expression tree manipulation also more pronounced
Convolution and Dimension Lifting • Model Processor and Level 1 cache. • Start with 1-d inputs(input dimension). • Envision 2nd dimension ranging over output values. • Envision Processors • Reshaped into a 3rd dimension. • The 2nd dimension is partitioned. • Envision Cache • Reshaped into a 4th dimension. • The 1st dimension is partitioned. • “psi” Reduce to Normal Form
Envision 2nd dimension ranging over output values. • Let tz=N+M-1 • M=th=4 • N=tx 4 x S h3 h2 h1 h0 0 0 0 x0 tz tz
- Envision Processors • Reshaped into a 3rd dimension. • The 2nd dimension is partitioned. Let p = 2 -4 -4 - -4- x x S S tz tz 2 2 2
Envision Cache • Reshaped into a 4th dimension • The 1st dimension is partitioned. x x S S 2 2 S S 2 2 x x 2 2 tz/2 tz 2 Tz/2 Tz/2 2
ONF for the Convolution Decomposition with Processors & Cache Generic form- 4 dimensional after “psi” Reduction • For i0= 0 to p-1 do: • For i11= 0 to tz/p –1 do: • sum 0 • For icacherow= 0 to M/cache -1 do: • For i3 = 0 to cache –1 do: • sum sum + h [(M-((icacherow* cache) + i3))-1]* x’[(((tz/p * i0)+i1) + icacherow* cache) + i3)] Let tz=N+M-1 M=th N=tx P r o c e s s o r l o o p C a c h e l o o p T I m e l o o p sum is calculated for each element of y. Time Domain
Outline • Overview • Array Algebra: MoA and Index Calculus: Psi Calculus • Time Domain Convolution • Other algorithms in Radar • Modified Gram-Schmidt QR Decompositions • MOA to ONF • Experiments • Composition of Matrix Multiplication in Beamforming • MoA to DNF • Experiments • FFT • Benefits of Using Moa and Psi Calculus
Algorithms in Radar Mechanize Using Expression Templates Time DomainConvolution (x,y) ONF for 1proc • Lift dimension • - Processor • - L1 cache • reformulate Use toreason about RAW Manual description & derivation for 1 processor DNF DNF ONF Thoughtson an Abstract Machine Modified Gram SchmidtQR (A) ImplementDNF/ONFFortran 90 CompilerOptimizationsDNF to ONF Benchmark at NCSAw/LAPACK A x (BH x C) Beamforming MoA & y Calculus
ONF for the QRDecomposition with Processors & Cache Initialization ProcessorLoop ComputeNorm MainLo o p ProcessorLoop Normalize ProcessorCache Loop DoTProduct Processor CacheLoop Ortothogonalize Modified Gram Schmidt
DNF for the Composition of A x (BH x C) Generic form- 4 dimensional • Z=0 • For i=0 to n-1 do: • For j=0 to n-1 do: • For k=0 to n-1 do: • z[k;]z[k;]+A[k;j]xX[j;i]xB[i;] Given A, B, X, Z n by n arrays Beamforming
Mechanizing MoA and Psi Reduction Index Theory IntroducedAbrams 1972 • MoA & y calculus theory: Mullin ’88 • Prototype compiler: output C, F90, HPF: Mullin and Thibault’94 • HPF compiler: AST manipulations: Mullin, et al ‘96 • SAC: functional C: Mullin and Bodo’96 • C++ classes: Helal, Sameh and Mullin’01 • C++ expression templates: Mullin, Ruttledge, Bond’02 • PVL with the portable expression template engine(PETE) • Parallel and distributed processing • Abstract machine • Automate cost and determine optimizations: minimize search space Fortran C Theory applied to embedded systems C++ Lifting Compiler Optimizations to Application Programmer Interface
On-going research • we are implementing the psi calculus using expression templates. • we are building on work done at MIT and we are working with MTL library developers (lumsdaine) at Indiana University and STL library developer, musser, at rpi.
Benefits of Using Moa and Psi Calculus • Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level. • Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities • Matches memory hierarchy levels • Avoids materialization of intermediate arrays. • Algorithm can be automatically(algebraically) transformed to reflect array reshapings above. • Facilitates programming expressed at a high level • Facilitates intentional program design and analysis • Facilitates portability • This approach is applicable to many other problems in radar.
Email and Question? • Lenore R. Mullin, lenore@cs.albany.edu • Daniel J. Rosenkrantz, djr@cs.albany.edu • Harry B. Hunt III, hunt@cs.albany.edu • Xingmin Luo, xluo@cs.albany.edu *The End*
Typical C++ Operator Overloading Example: A=B+C vector add 2 temporary vectors created Main 1. Pass B and C references to operator + Additional Memory Use B&, C& • Static memory • Dynamic memory (also affects execution time) Operator + 2. Create temporary result vector 3. Calculate results, store in temporary 4. Return copy of temporary temp B+C temp Additional Execution Time temp copy 5. Pass results reference to operator= • Cache misses/page faults • Time to create anew vector • Time to create a copy of a vector • Time to destructboth temporaries temp copy & Operator = temp copy A 6. Perform assignment
C++ Expression Templates and PETE Parse Tree Expression Type Expression + BinaryNode<OpAdd, Reference<Vector>, Reference<Vector > > ExpressionTemplates A=B+C C B Main Parse trees, not vectors, created Parse trees, not vectors, created 1. Pass B and Creferences to operator + Reduced Memory Use B&, C& Operator + • Parse tree only contains references 2. Create expressionparse tree + B& C& 3. Return expressionparse tree copy Reduced Execution Time 4. Pass expression treereference to operator • Better cache use • Loop fusion style optimization • Compile-time expression tree manipulation copy & Operator = 5. Calculate result andperform assignment B+C A • PETE, the Portable Expression Template Engine, is available from theAdvanced Computing Laboratory at Los Alamos National Laboratory • PETE provides: • Expression template capability • Facilities to help navigate and evaluating parse trees PETE: http://www.acl.lanl.gov/pete
Implementing Psi Calculus with Expression Templates take 4 drop rev 3 B size=4 A[i]=B[-i+6] size=7 A[i] =B[-(i+3)+9] =B[-i+6] size=10 A[i] =B[-i+B.size-1] =B[-i+9] size=10 A[i]=B[i] Example: A=take(4,drop(3,rev(B))) B=<1 2 3 4 5 6 7 8 9 10> A=<7 6 5 4> Recall: Psi Reduction for 1-d arrays always yields one or more expressions of the form: x[i]=y[stride*i+ offset] l ≤ i < u 1. Form expression tree 3. Apply Psi Reduction rules 2. Add size information take size=4 drop 4 size=7 Reduction Size info rev 3 size=10 size=10 B 4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root • Iterators used for efficiency, rather than recalculating indices for each i • One “for” loop to evaluate each sub-expression iterator: offset=6stride=-1 size=4