250 likes | 369 Views
A uniform algebraically-based approach to computational physics and efficient programming. James E. Raynolds College of Nanoscale Science and Engineering University at Albany, State University of New York, Albany, NY 12309 Lenore Mullin, Computer Science
E N D
A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds College of Nanoscale Science and Engineering University at Albany, State University of New York, Albany, NY 12309 Lenore Mullin, Computer Science University at Albany, State University of New York, Albany, NY 12309
Matrix Example • In Fortran 90: • First temporary computed: • Second temporary: • Last operation:
Matrix Example (cont) • Intermediate temporaries consume memory and add to processing operations • Solution: compose index operations • Loop over i, j: • No temporaries:
Need for formalism • Few problems are as simple as • Formalism designed to handle extremely complicated situations systematically • Goal: composition of algorithms • For Example: Radar is composed of the composition of numerous algorithms: QR(FFT(X)). • Optimizations are classically done sequentially even when parallel processors and nodes are used. FFT(or DFT?) then QR • Optimizations can be optimized across algorithms, processors, and memories
MoA and PSI Calculus • Basic Properties: • An index calculus: psi function. • Shape polymorphic functions and operators: • Operationsare definedusing shapes and psi. • MoA defines some useful operations and function. • As long as shapes define functions and operations any new function or operation may be defined and reduced. • Fundamental type is the array: • scalars are 0-dimensional arrays. • Denotational Normal Form(DNF) = reduced form in Cartesian coordinates (independent of data layout: row major, column major, regular sparse, …) • Operational Normal Form(ONF) = reduced form for 1-d memory layout(s). • Defines How to Build the code on processor/memory hierarchies. ONF reveals loops and control.
Applications Levels of Processor/Memory Hierarchy • Can be Modeled by Increasing Dimensionality of • Data Array. • Additional dimension for each level of the hierarchy. • Envision data as reshaped/transposed to reflect mapping to increased dimensionality. • An Index Calculus automatically transforms algorithm to reflect restructured data array. • Data, layout, data movement, and scalarization automatically generated based on MoA descriptions and Psi Calculus Definitions of Array Operations, Functions and their compositions. • Arrays are any dimension, even 0, I.e. scalars
Processor/Memory Hierarchycontinued intricate math • Math and indexing operations in same expression • Framework for design space search • Rigorous and provably correct • Extensible to complex architectures Mathematics of Arrays y= conv (x) intricatememory accesses (indexing) Map Approach Example: “raising” array dimensionality < 0 1 2 > x: < 0 1 2 … 35 > < 3 4 5 > P0 Main Memory < 6 7 8 > < 9 10 11 > L2 Cache < 12 13 14 > L1 Cache < 15 16 17 > P1 Memory Hierarchy Map: < 18 19 20 > < 21 22 23 > < 24 25 26 > < 27 28 29 > P2 Parallelism < 30 31 32 > < 33 34 35 >
Manipulation of an array • Given a 3 by 5 by 4 array: • Shape vector: • Index vector: • Used to select:
More Definitions • Reverse: Given an array • The reversal is given through indexing • Examples:
Some Psi Calculus OperationsBuilt Using y & Shapes Operations Arguments Definition take Vector A, int N Forms a Vector of the first N elements of A drop Vector A, int N Forms a Vector of the last (A.size-N) elements of A Forms a Vector of the last N elements of A concatenated to the other elements of A rotate Vector A, int N Vector A, Vector B cat Forms a Vector that is the concatenation of A and B Operation Op, dimension D, Array A Applies unary operator Op to D-dimensional components of A (like a for all loop) unaryOmega Operation Op,Dimension Adim. Array A, Dimension Bdim, Array B Applies binary operator Op to Adim-dimensional components of A and Bdim-dimensional components of B (like a for all loop) binaryOmega Reshapes B into an array having A.size dimensions, where the length in each dimension is given by the corresponding element of A reshape Vector A, Vector B iota int N Forms a vector of size N, containing values 0 . . N-1 = index permutation = operators = restructuring = index generation
New FFT algorithm: record speed • Maximize in-cache operations through use of repeated transpose-reshape operations • Similar to partitioning for parallel implementation • Do as many operations in cache as possible • Re-materialize the array to achieve locality • Continue processing in cache and repeat process
Example • Assume cache size c = 4; input vector length n = 32; number of rows r = n/c = 8 • Generate vector of indices: • Use re-shape operator to generate a matrix
Starting Matrix • Each row is of length equal to the size “c” • Standard butterfly applied to each row as...
Next transpose • To continue further would induce cache misses so transpose and reshape. • Transpose-reshape operation composed over indices (only result is materialized. • The transpose is:
Resulting Transpose-Reshape • Materialize the transpose-reshaped array B • Carry out butterfly operation on each row • Weights are re-ordered • Access patterns are standard...
Transpose-Reshape again • As before: to proceed further would induce cache misses so: • Do the transpose-reshape again (composing indices) • The transpose is:
Last step (in this example) • Materialize the composed transpose-reshaped array C • Carry out the last step of the FFT • This last step corresponds to cycles of length 2 involving elements 0 and 16, 1 and 17, etc. 1
Final Transpose • Data has been permuted numerous times • Multiple reshape-transposes • We could reverse the transformations • There would be multiple steps, multiple writes. • Viewing the problem as an n-cube(hypercube for radix 2) allows us to use the number of reshape-transposes as an argument to rotate(or shift) of a vector generated from the dimension of the hypercube. • This rotated vector is used as an argument to binary transpose. • Permutes everything at once. • Express Algebraically, Psi reduce to DNF then ONF for a generic design. • ONF has only two loops no matter what dimension hypercube(or n-cube for radix = n) we start with.
Summary • All operations have been carried out in cache at the price of re-arranging the data • Data blocks can be of any size (powers of the radix): need not equal the cache size • Optimum performance: tradeoff between reduction of cache misses and cost of transpose-reshape operations • Number of transpose-reshape operations determined by the data block size (cache size) • Record performance: up to factor of 4 better than libraries