440 likes | 574 Views
Implementation of Parallel Algorithms for Heterogeneous Platforms. Implementation issues. Heterogeneous parallel algorithms Design and analysis Good progress over last decade Scientific software based on the algorithms Very little done Why?
E N D
Implementation of Parallel Algorithms for Heterogeneous Platforms
Implementation issues • Heterogeneous parallel algorithms • Design and analysis • Good progress over last decade • Scientific software based on the algorithms • Very little done • Why? • Implementation of the algorithms in a portable and self-adaptable form • A non-trivial and very tedious task itself • Poses additional challenges Heterogeneous and Grid Compuitng
Implementation issues (ctd) • Accuracy of the hardware model • A lot of extra code needed to provide accurate values of parameters of the heterogeneous hardware • Portability • Automatic tuning of the program to any executing platform • Possibly dynamically changing performance characteristics • More complex extra code needed Heterogeneous and Grid Compuitng
Implementation issues (ctd) • Heterogeneous parallel algorithm • Designed in a generic, parameterized form • Parameters • Problem parameters • Parameters of the problem to be solved • The size of matrix to be factorized • Can be only provided by the user Heterogeneous and Grid Compuitng
Implementation issues (ctd) • Parameters (ctd) • Algorithmic parameters • Represent different variations and configurations of the algorithm • The size of matrix block in local computations • The total number of processes executing the algorithm • The arrangement of the processes • Do not change the result of computations • Have an impact on the performance • (Optimal) values can be provided by the user, or found by the software implementing the algorithm Heterogeneous and Grid Compuitng
Implementation issues (ctd) • Parameters (ctd) • Platform parameters • Parameters of the performance model of the executing heterogeneous platform • The speed of the processors • The bandwidth and latency of the links • Have a major impact on the performance of the program Heterogeneous and Grid Compuitng
Implementation issues (ctd) • A good program implementing a heterogeneous parallel algorithm • Should provide accurate platform parameters • Should provide optimal values of (some) algorithmic parameters Heterogeneous and Grid Compuitng
Implementation issues (ctd) • Program code • Core code • Implements the algorithm for each valid combination of the values of its parameters • Extra code • Solving the problems of finding accurate platform and optimal algorithmic parameters • Non-trivial and significant in amount Heterogeneous and Grid Compuitng
Implementation issues (ctd) • How programming systems can help • Core code • Automation = automatic design of heterogeneous parallel algorithms => unrealistic • Extra code • Can and should be provided by such systems • Application specific code (generated by compiler from the specification of the algorithm) • Not application specific code • Run-time system and libraries Heterogeneous and Grid Compuitng
Implementation issues (ctd) • Programming systems for heterogeneous parallel computing • No miracles • Nothing for dummies • Help qualified algorithm designers implement their algorithms • Automates non-trivial but routine computations and communications Heterogeneous and Grid Compuitng
Implementation issues (ctd) • Heterogeneous programming systems • Can also help in efficient implementation of traditional homogeneous parallel algorithms • The whole computation is partitioned into equal chunks • Each chunk is performed by a separate process • The number of processes run by each processor is proportional to the relative speed of the processor • The code for accurate estimation of platform parameters, optimization of algorithmic parameters and optimal mapping of processes to the processors can be provided by the programming system • The programmer just specifies the algorithm Heterogeneous and Grid Compuitng
Estimation of performance models • Accurate estimation of performance parameters • A key to efficient implementation of the algorithm • Wrong estimation => poor performance • Estimation of constant models of processors • p, S={s1,…,sp} • si – relative speeds, • Also used absolute speeds but only for convenience • General approach • Running the same benchmark code on each processor • Use the execution time to calculate its relative speed Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors • No single universal benchmark code • Should be carefully designed for each application • Efficiency • Not an issue if • application to be run multiple times on the same cluster with stable and reproducible performance characteristics • Benchmark code can be separated from the application and run once • Its execution time can be neglected compared to the total time of all subsequent executions of application • Issue otherwise • Each execution is in a unique environment • Benchmark code should a part of the application Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Simple case: data parallel applications • One-process-per-processor • Iterative computations • Static data layout • The same task of the same size solved at each iteration • Processed data may be different but of the same pattern • All processors solve a task of the same size at any one iteration of the main loop • Load is balanced by different numbers of iterations • Benchmark: any one iteration of the main loop • Efficient • Representative Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Sample application • Parallel matrix multiplication, C=A×B, based on one-dimensional horizontal partitioning of A and C • One-to-one mapping between slices and processors • All processors compute their slices in parallel by executing a loop, each iteration of which computes one row of C Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Benchmark code • Multiplication of one n-element row by an n×n matrix • Relative speed during the execution of this benchmark and the application is the same • For each processor, the execution time =the execution time of one iteration × the number of iterations • Can be done even more efficient • Multiplication of one row by a number of adjacent columns • Balance between accuracy (fluctuations) and efficiency Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Not that simple case: data parallel applications • One-process-per-processor • Iterative computations • Static data layout • For each processor, the same task of the same size solved at each iteration • At each iteration of the main loop, different processors solve a task of different sizes • Load is balanced by different task sizes • Given the same number of iterations • Benchmark code • Extra problem: the most representative task size Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Example • Parallel matrix multiplication, C=A×B, based on the two-dimensional q×tCartesian partitioning of matrices • One-to-one mapping between rectangles and processors • At each step k of the main loop of the algorithm, • The pivot column of r×rblocks of matrix Ais broadcast horizontally • The pivot row of r×rblocks of matrix Bis broadcast vertically • Each processor Pijupdates its rectangle cijof matrix C with the product of its parts of the pivot column and the pivot row • At each iteration, processor Pijupdates an hi×wjmatrix by the product of hi×r and r×wjmatrices • the same task size for all iterations Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Benchmark code • All processors perform the same number of iterations • Load is balanced by using different task sizes • => Any task size – not fully representative • Does not reproduce in full the real layout • For any heterogeneous platform • There exists a range of task sizes with approximately constant relative speeds • If matrix partitions fall into this range, any task size from this range can be used for the benchmark code • For example, Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • More difficult case: data parallel applications • One-process-per-processor • Iterative computations • Static data layout • For each processor, tasks of different sizes solved at different iterations • At each iteration of the main loop, different processors solve tasks of different sizes Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Example • Heterogeneous parallel LU factorization • At each iteration, the main body of computations of each processor falls into the update • Task sizes are asymptotically decreasing to zero • => Task sizes vary in a very wide range • Unrealistic to assume that the relative speed will remain constant within such a wide range • => No task size will accurately estimate the realtive speed for all processors Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Benchmark code • Different iterations have different computation cost • We can focus on most costly iterations • Some number of first iterations • Assume that task sizes for the iteration fall into the range where the relative speed is approx. constant Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Summary The benchmark code solving the task of some fixed size, which represents one iteration of the main loop of the application, can be efficient and accurate for many data parallel applications performing iterative computations Heterogeneous and Grid Compuitng
Estimation of constant performance models of heterogeneous processors (ctd) • Programming systems providing basic support for accurate estimation of relative speeds • mpC, HeteroMPI • recon statement, HMPI_Recon() function • Benchmark code is provided by the programmer • Execution of the statement • The code is executed by all processors in parallel • Execution time is used to obtain their relative speed • Programmer fully controls the accuracy of estimation • What code, where and when to run Heterogeneous and Grid Compuitng
Estimation of non-constant performance models of heterogeneous processors • Non-constant models • Functional, band • Straightforward estimation of functional model • Assume a single performance variable, x • Interval [a, b] divided into equal subintervals [xi, xi+1] • Execute the application for each task size xi • Build a piecewise linear approximation of the speed function f(x) • At each next step • Bisect the subintervals and run the application in the midpoints • Build the next piecewise approximation • Stop if error criterion satisfied. Otherwise, repeat the step. Heterogeneous and Grid Compuitng
Estimation of non-constant performance models of heterogeneous processors (ctd) • The straightforward estimation procedure • Can be very expensive • A big number of points may be needed for convergence • Minimization of the cost • Open problem • One approach so far (based on the use of speed band) Heterogeneous and Grid Compuitng
Estimation of non-constant performance models of heterogeneous processors (ctd) • Obtaining a cut Heterogeneous and Grid Compuitng
Estimation of non-constant performance models of heterogeneous processors (ctd) Two types of speed bands Heterogeneous and Grid Compuitng
Estimation of non-constant performance models of heterogeneous processors (ctd) Initial approximation Heterogeneous and Grid Compuitng
Estimation of non-constant performance models of heterogeneous processors (ctd) Approximation of the increasing section Heterogeneous and Grid Compuitng
Estimation of non-constant performance models of heterogeneous processors (ctd) Approximation of the non-increasing section Heterogeneous and Grid Compuitng
Estimation of non-constant performance models of heterogeneous processors (ctd) Possible scenarios when the next experimental point falls in the area of the current trapezoidal approximation Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters • Algorithmic parameters • Have a significant impact on the performance • Two types • Not changing the volume of computation and communication • Example: size of matrix block in local computations • Main optimization approach: • Locally run a benchmark code for its different values • Can be done once (upon installation) or at runtime • Example: ATLAS • Optimizes the parameters for some performace critical operation (matrix multiplication) • Design: Highly parameterized code generator • Effect: 10s times faster than BLAS • Changing the volume of computation or/and communication Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters (ctd) • Algorithmic parameters with a direct effect on performance • Example: logical shape of processors arrangement • If the shape is an input parameter of the algorithm, then its self-adaptable implementation should include finding its (sub)optimal value • Typical: the number and the ordering of processors • Involving all available processors may not be optimal • Due to high communication cost • => self-adaptable implementation should find the optimal subset of processors and properly order it • Optimization approaches • Straightforward: running benchmarks • More efficient: based on the use of performance models of implemented algorithms Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters (ctd) • Model-based optimization of algorithmic parameters • Originally proposed in the mpC programming language • The idea: allow the programmer to describe main performance-related features of the algorithm • The number of processors executing the algorithm • The total volume of computations performed by each of the processors during the execution of the algorithm • The total volume of data communicated between each pair of the processors during the execution of the algorithm • The description • Parameterized by the problem and algorithmic parameters • Defines a performance model of the algorithm Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters (ctd) • The description • Translated into code used at runtime to estimate the execution time of the algorithm (without real execution) • For each combination of performance and algorithmic parameters • mpC: provides the timeof operator • The only operand is a fully specified algorithm • The result is the estimated execution time • Can be used to implement self-adaptable applications • Not only to different platforms but to different states of the same platform for different runs • HeteroMPI: HMPI_Timeof() has the same functionality Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters (ctd) • Example. Matrix multiplication, C=A×B, on heterogeneous processors • Algorithm parameters: n, p, s1,…,sp • Application implementing the algorithm • Goal: minimize the execution time, not the computation time Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters (ctd) • Self-adaptable application design • It should find the optimal subset of the processors minimizing the execution time • Assume for simplicity, a homogeneous communication layet • The optimal subset will always include the fastest processors • Finding p optimal processors out of q available: • Use the benchmark multiplying one n-element row and a dense n×n matrix to estimate s1,...,sq • Re-arrange the processors such that s1≥…≥sq • Given t0=∞, for i=1 until i≤q: • Estimate ti : the execution time given the optimal partitioning of the matrices over processors P1,…,Pi • Ifti<ti-1theni=i+1 and continue elsep=i and stop Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters (ctd) • Code in mpC. Performance model of the algorithm: algorithm AxB(int p, int n, int d[p]) { coord I=p; node { I>=0: bench*(d[I]); }; link (J=p) { I!=J: length(double)*(d[I]*n) [J]->[I];}; }; Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters (ctd) • mpC. The rest of relevant code: // Run a benchmark code in parallel by all physical // processors to update the estimation of their speeds { repldouble *row, *matrix, *result; // memory allocation for row, matrix, and result // initialization of row, matrix, and result ... recon RowxMatrix(row, matrix, result, n); } Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters (ctd) • mpC. The rest of relevant code: // Get the total number of physical processors q = MPC_Get_number_of_processors(); // Get the speed of the physical processors speeds = calloc(q, sizeof(double)); MPC_Get_processors_info(NULL, speeds); // Sort the speeds in descending order qsort(speeds+1, q-1, sizeof(double), compar); Heterogeneous and Grid Compuitng
Optimization of algorithmic parameters (ctd) • mpC. The rest of relevant code: // Calculate the optimal number of physical processors [host]: { int p, *d; struct {int p; double t;} min; double t; d = calloc(q, sizeof(int)); min.p = 0; min.t = DBL_MAX; for(p=1; p<=q; p++) { // Partition C over p involved physical processors Partition(p, speeds, d, n); // Estimate the execution time of matrix multiplication // on m physical processors t = timeof(algorithm AxB(p, n, d)); if(t<min.t) { min.p = p; min.t = t; } } p = min.p; } Heterogeneous and Grid Compuitng
Implementation of homogeneous algorithms for heterogeneous platforms • The HeHo approach • Multiple processes per processor • Problem: The optimal configuration of the application • The optimal subset of heterogeneous processors • The optimal distribution of processes over the processors • mpC/HeteroMPI automate • Accurate estimation of platform parameters • Optimization of algorithmic parameters • Including the number parallel processes and their arrangement • Optimal mapping of the parallel processes to the heterogeneous processors Heterogeneous and Grid Compuitng