Advances in the Optimization of Parallel Routines (I)

Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo Universidad Politécnica de Valencia

Outline • A little history • Modelling Linear Algebra Routines • Installation routines • Autotuning routines • Modifications to libraries’ hierarchy • Polylibraries • Algorithmic schemes • Heterogeneous systems • Peer to peer computing Universidad Politécnica de Valencia

Collaborations and autoreferences • Modelling Linear Algebra Routines • + J. Cuenca + J. González: • Modelling the Behaviour of Linear Algebra Algorithms with Message-passing. 2001 • Towards the Design of an Automatically Tuned Linear Algebra Library. 2002 • + J. Cuenca + L. P. García + J. González + A. Vidal: • Empirical Modelling of Parallel Linear Algebra Routines. 2003 Universidad Politécnica de Valencia

Colaborations and autoreferences • Installation routines • + G. Carrillo: • Installation routines for linear algebra libraries on LANs. 2000 • + G. Carrillo + J. Cuenca + J. González: • Optimización automática de rutinas paralelas de álgebra lineal. 2000 Universidad Politécnica de Valencia

Colaborations and autoreferences • Autotuning routines • + J. Cuenca + J. González: • Automatic parameterization of parallel linear algebra routines. 2001 • + J. Cuenca: • Some considerations about the Automatic Optimization of Parallel Linear Algebra Routines. 2002 Universidad Politécnica de Valencia

Colaborations and autoreferences • Modifications to the libraries hierarchy • + J. Cuenca + J. González: • Architecture of an Automatic Tuned Linear Algebra Library. 2002 - 2004 Universidad Politécnica de Valencia

Colaborations and autoreferences • Polylibraries • + P. Alberti + P. Alonso + J. Cuenca + A. Vidal: • Designing Polylibraries to Speed Up Parallel Computations. 2003 Universidad Politécnica de Valencia

Colaborations and autoreferences • Algorithmic schemes • + J. P. Martínez: • Automatic Optimization in Parallel Dynamic Programming Schemes. 2004 Universidad Politécnica de Valencia

Colaborations and autoreferences • Heterogeneous systems • + J. Cuenca + J. Dongarra + J. González + K. Roche: • Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003 • + J. Cuenca + J. P. Martínez: • Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004 Universidad Politécnica de Valencia

A little history • Parallel optimization in the past: • Hand-optimization for each platform • Time consuming • Incompatible with hardware evolution • Incompatible with changes in the system (architecture and basic libraries) • Unsuitable for systems with variable workloads • Misuse by non expert users Universidad Politécnica de Valencia

A little history • Initial solutions to this situation: • Problem-specific solutions • Polyalgorithms • Installation tests Universidad Politécnica de Valencia

A little history • Problem specific solutions: • Brewer (1994): Sorting Algorithms, Differential Equations • Frigo (1997): FFTW: The Fastest Fourier Transform in the West • LAWRA (1997): Linear Algebra With Recursive Algorithms Universidad Politécnica de Valencia

A little history • Polyalgorithms: • Brewer • FFTW • PHiPAC (1997): Linear Algebra Universidad Politécnica de Valencia

A little history • Installation tests: • ATLAS (2001): Dense Linear Algebra, sequential • Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm • I-LIB (2000): some parallel linear algebra routines Universidad Politécnica de Valencia

A little history • Parallel optimization today: • Optimization based on computational kernels • Systematic development of routines • Auto-optimization of routines • Middleware for auto-optimization Universidad Politécnica de Valencia

A little history • Optimization based on computational kernels: • Efficient kernels (BLAS) and algorithms based on these kernels • Auto-optimization of the basic kernels (ATLAS) Universidad Politécnica de Valencia

A little history • Systematic development of routines: • FLAME project • R. van de Geijn + E. Quintana + … • Dense Linear Algebra • Based on Object Oriented Design • LAWRA • Dense Linear Algebra • For Shared Memory Systems Universidad Politécnica de Valencia

A little history • Auto-optimization of routines: • At installation time: • ATLAS, Dongarra + Whaley • I-LIB, Kanada + Katagiri + Kuroda • SOLAR, Cuenca + Giménez + González • LFC, Dongarra + Roche • At execution time: • Solve a reduced problem in each processor (Kalinov + Lastovetsky) • Use a system evaluation tool (NWS) Universidad Politécnica de Valencia

A little history • Middleware for auto-optimization: • LFC: • Middleware for Dense Linear Algebra Software in Clusters. • Hierarchy of autotuning libraries: • Include in the libraries installation routines to be used in the development of higher level libraries • FIBER: • Proposal of general middleware • Evolution of I-LIB • mpC: • For heterogeneous systems Universidad Politécnica de Valencia

A little history • Parallel optimization in the future?: • Skeletons and languages • Heterogeneous and variable-load systems • Distributed systems • P2P computing Universidad Politécnica de Valencia

A little history • Skeletons and languages: Develop skeletons for parallel algorithmic schemes together with execution time models and provide the users with these libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa) Universidad Politécnica de Valencia

A little history • Heterogeneous and variable-load systems: Heterogeneous algorithms: unbalanced distribution of data (static or dynamic) Homogeneous algorithms: more processes than processors and assignation of processes to processors (static or dynamic) Variable-load systems as dynamic heterogeneous Universidad Politécnica de Valencia

A little history • Distributed systems: Intrinsically heterogeneous and variable-load Very high cost of communications Necessary special middleware (Globus, NWS) There can be servers to attend queries of clients Universidad Politécnica de Valencia

A little history • P2P computing: Users can go in and out dynamically All the users are the same type (initially) Is distributed, heterogeneous and variable-load But special middleware is necessary Universidad Politécnica de Valencia

Outline • A little story • Modelling Linear Algebra Routines • Installation routines • Autotuning routines • Modifications to libraries’ hierarchy • Polylibraries • Algorithmic schemes • Heterogeneous systems • Peer to peer computing Universidad Politécnica de Valencia

Modelling Linear Algebra Routines Necessary to predict accurately the execution time and select • The number of processes • The number of processors • Which processors • The number of rows and columns of processes (the topology) • The processes to processors assignation • The computational block size (in linear algebra algorithms) • The communication block size • The algorithm (polyalgorithms) • The routine or library (polylibraries) Universidad Politécnica de Valencia

Modelling Linear Algebra Routines Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation, ... : overlapping of communication and computation Universidad Politécnica de Valencia

Modelling Linear Algebra Routines Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. Universidad Politécnica de Valencia

Modelling Linear Algebra Routines The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors Universidad Politécnica de Valencia

Modelling Linear Algebra Routines And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (tc) and the start-up (ts) and word-sending time (tw) Universidad Politécnica de Valencia

A11 A12 A13 A21 A22 A23 A31 A32 A33 Modelling Linear Algebra Routines LU factorisation (Golub - Van Loan): = Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems) Step 3: (multiple upper triangular systems) Step 4: (update south-east blocks) U11 U12 U13 L11 U22 U23 L21 L22 U33 L31 L32 L33 Universidad Politécnica de Valencia

Modelling Linear Algebra Routines The execution time is If the blocks are of size 1, the operations are all with individual elements, but if the blocks size is b the cost is With k3 and k2 the cost of operations performed with BLAS 3 or 2 Universidad Politécnica de Valencia

Modelling Linear Algebra Routines But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as: Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and ... Universidad Politécnica de Valencia

Modelling Linear Algebra Routines The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b) The formula has the form: And what we want is to obtain the values of AP with which the lowest execution time is obtained Universidad Politécnica de Valencia

Modelling Linear Algebra Routines The values of the System Parameters could be obtained • With installation routines associated to each linear algebra routine • From information stored when the library was installed in the system, thus generating a hierarchy of libraries with auto-optimization • At execution time by testing the system conditions prior to the call to the routine Universidad Politécnica de Valencia

Modelling Linear Algebra Routines These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time Universidad Politécnica de Valencia

Modelling Linear Algebra Routines Parallel block LU factorisation: matrix distribution of computations in the first step processors Universidad Politécnica de Valencia

Modelling Linear Algebra Routines Distribution of computations on successive steps: second step third step Universidad Politécnica de Valencia

Modelling Linear Algebra Routines The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r c d=max(r,c) System Parameters: cost of arithmetic operations: k2,getf2k3,trsmm k3,gemm communication parameters:tstw Universidad Politécnica de Valencia

Modelling Linear Algebra Routines The cost of parallel block QR factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r c System Parameters: cost of arithmetic operations:k2,geqr2 k2,larft k3,gemmk3,trmm communication parameters:tstw Universidad Politécnica de Valencia

Modelling Linear Algebra Routines The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR) and a common format is necessary to store the information Universidad Politécnica de Valencia

Modelling Linear Algebra Routines Universidad Politécnica de Valencia

80,00 mean 70,00 model 60,00 optimum 50,00 40,00 time (seconds) 30,00 20,00 10,00 0,00 512 1024 1536 2048 2560 3072 3584 problem size Modelling Linear Algebra Routines Parallel QR factorisation “mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user) “optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters “model” is the execution time with the values selected with the model IBM-SP2. 8 processors Universidad Politécnica de Valencia

IBM SP2 p=4 p=8 b r c b r c 1024 16 1 4 16 1 8 2048 32 1 4 16 1 8 3072 32 1 4 32 2 4 4096 32 1 4 32 2 4 p=4 p=8 b r c b r c 1024 32 4 1 32 4 2 Origin 2000 2048 64 4 1 32 4 2 3072 32 4 2 4096 64 4 2 - Modelling Linear Algebra Routines Parameter selection for the QR algorithm - p=4 p=8 b r c b r c 1024 16 1 4 16 1 8 2048 16 1 4 16 1 8 Network of Pentium III with Fast Ethernet 3072 32 1 4 32 1 8 4096 32 1 4 32 1 8 Universidad Politécnica de Valencia

Installation Routines In the formulas (parallel block LU factorisation) The values of the System Parameters (k2,getf2 ,k3,trsmm , k3,gemm ,ts ,tw) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b,r,c) Universidad Politécnica de Valencia

Installation Routines By running at installation time Installation Routines associated to the linear algebra routine And storing the information generated to be used at running time  Each linear algebra routine must be designed together with the corresponding installation routines, and the installation process must be detailed Universidad Politécnica de Valencia

Installation Routines is estimated by performing matrix-matrix multiplications and updatings of size (n/r b)  (b n/c) Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes Universidad Politécnica de Valencia

Installation Routines two multiple triangular systems are solved, one upper triangular of size b n/c , and another lower triangular of size n/r b Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r As for the previous parameter, values can be obtained for different problem sizes Universidad Politécnica de Valencia

Advances in the Optimization of Parallel Routines (I)

Advances in the Optimization of Parallel Routines (I)

Presentation Transcript

Daily Routines

Routines and Defensive Programming

Advances in the adjuvant treatment of colorectal cancer

HABITS AND ROUTINES

Routines

Daily ROUTINES

CS 6260 PARALLEL COMPUTATION PARALLEL ALGORITHMS IN COMBINATORIAL OPTIMIZATION PROBLEMS

New Classroom Routines

Routines

Auto-optimization of linear algebra parallel routines: the Cholesky factorization

Optimization of the mMIPS

Parallel Query Optimization

Code Optimization of Parallel Programs

Optimization and evaluation of parallel I/O in BIPS3D parallel irregular application

Instructional Routines

Design of parallel algorithms

Overview of Optimization

Routines – ASL II

Recent Advances in Query Optimization

2D Routines in 3D