310 likes | 390 Views
Automatic optimization of parallel linear algebra software. Domingo Giménez Department of Programming, Languages and Systems Teaching Algorithms and Parallel Programming Javier’s Ph. D. Director in collaboration with José Gonzalez (Department of Computer Architecture) Javier Cuenca
E N D
Automatic optimization of parallel linear algebra software Domingo Giménez Department of Programming, Languages and Systems Teaching Algorithms and Parallel Programming Javier’s Ph. D. Director in collaboration with José Gonzalez (Department of Computer Architecture) Javier Cuenca Department of Computer Architecture Teaching Computer Structure Ph. D. Student: Automatic optimization of parallel linear algebra software University of Murcia SPAIN
Linear Algebra: highly optimizable operations, but optimizations are Platform Specific Traditional method: Hand-Optimization for each platform Time-consuming Incompatible with Hardware Evolution Incompatible with changes in the system (architecture and basic libraries) Unsuitable for systems with variable workload Misuse by non expert users Current Situation of Linear Algebra Parallel Routines
Some groups and projects: ATLAS, GrADS, LAWRA, FLAME, I-LIB But the problem is very complex. Solutions to this situation?
Our approach • Routines Parameterised: System parameters, Algorithmic parameters • System parameters obtained at installation time Analytical model of the routine and simple installation routines to obtain the system parameters A reduced number of executions at installation time • Algorithmic parameters From the analytical model with the system parameters obtained in the installation process
D E S I G N I N S T A L L A T I O N LIBRARY LAR-DESIGNER Our approach: the scheme LAR MODELLING LAR IMPLEMEN. OF LAR-ERs LAR-MOD LAR-ERs BL EXECUT. OF LAR-ERs LAR-IF OAP SELECTION LAR-SPF LAR-OAPF INCLUSION PROCESS SYSTEM MANAGER
D E S I G N LAR-DESIGNER Design: Modelling the LAR LAR MODELLING LAR LAR-MOD
LAR-MOD:Analytical Model of LAR The behaviour of the algorithm on the platform is defined Texec = f (SPs, n, APs) • SPs = f(n, APs)System Parameters • APsAlgorithmic Parameters • nProblem Size
LAR-MOD:Analytical Model of LAR System Parameters (SPs): Hardware Platform • Physical Characteristics • Current Conditions Basic libraries LARs Performance
LAR-MOD:Analytical Model of LAR System Parameters (SPs): Hardware Platform • Physical Characteristics • Current Conditions Basic libraries Two Kinds of SPs: Communication System Parameters (CSPs) Arithmetic System Parameters (ASPs) LARs Performance
LAR-MOD:Analytical Model of LAR System Parameters (SPs): Hardware Platform • Physical Characteristics • Current Conditions Basic libraries Two Kinds of SPs: Communication System Parameters (CSPs): ts start-up time tw word-sending time Arithmetic System Parameters (ASPs) LARs Performance
LAR-MOD:Analytical Model of LAR System Parameters (SPs): Hardware Platform • Physical Characteristics • Current Conditions Basic libraries Two Kinds of SPs: Communication System Parameters (CSPs) Arithmetic System Parameters (ASPs): tc arithmetic cost. Using BLAS: k1 k2 and k3 LARs Performance
LAR-MOD:Analytical Model of LAR System Parameters (SPs): Hardware Platform • Physical Characteristics • Current Conditions Basic libraries How to estimate each SP? 1º.- Obtain the kernel of performance cost of LAR 2º.- Make an Estimation Routine from this kernel LARs Performance
D E S I G N LAR-DESIGNER Design LAR MODELLING LAR LAR-MOD
D E S I G N LAR-DESIGNER Design: Making the LAR-ERs LAR MODELLING LAR IMPLEMEN. OF LAR-ERs LAR-MOD LAR-ERs
Arithmetic System Parameters (ASPs): Computation Kernel of the LAR Estimation Routine • Similar storage scheme • Similar quantity of data Communication System Parameters (CSPs): Communication Kernel of the LAR Estimation Routine • Similar kind of communication • Similar quantity of data LAR-ERs: Estimation Routines
D E S I G N LAR-DESIGNER Design LAR MODELLING LAR IMPLEMEN. OF LAR-ERs LAR-MOD LAR-ERs
D E S I G N LAR-DESIGNER HAND-MADE ONLY ONCE Design: Process has finished LAR MODELLING LAR IMPLEMEN. OF LAR-ERs LAR-MOD LAR-ERs
D E S I G N I N S T A L L A T I O N LAR-DESIGNER Installation: Runing the LAR-ERs LAR MODELLING LAR IMPLEMEN. OF LAR-ERs LAR-MOD LAR-ERs BL EXECUT. OF LAR-ERs LAR-IF LAR-SPF SYSTEM MANAGER
D E S I G N I N S T A L L A T I O N LAR-DESIGNER Installation: obtaining the OAP LAR MODELLING LAR IMPLEMEN. OF LAR-ERs LAR-MOD LAR-ERs BL EXECUT. OF LAR-ERs LAR-IF OAP SELECTION LAR-SPF LAR-OAPF SYSTEM MANAGER
Installation: obtaining the OAP Algorithmic Parameters (APs) Known the SPs values, the Optimum Values for the APs are calculated (OAP): b block size p number of processors rclogical topology grid configuration (logical 2D mesh)
D E S I G N I N S T A L L A T I O N LAR-DESIGNER Installation LAR MODELLING LAR IMPLEMEN. OF LAR-ERs LAR-MOD LAR-ERs BL EXECUT. OF LAR-ERs LAR-IF OAP SELECTION LAR-SPF LAR-OAPF SYSTEM MANAGER
D E S I G N I N S T A L L A T I O N LIBRARY LAR-DESIGNER Installation: putting it all together LAR MODELLING LAR IMPLEMEN. OF LAR-ERs LAR-MOD LAR-ERs BL EXECUT. OF LAR-ERs LAR-IF OAP SELECTION LAR-SPF LAR-OAPF INCLUSION PROCESS SYSTEM MANAGER
D E S I G N I N S T A L L A T I O N LIBRARY LAR-DESIGNER Installation process finished LAR MODELLING LAR IMPLEMEN. OF LAR-ERs LAR-MOD LAR-ERs BL EXECUT. OF LAR-ERs LAR-IF OAP SELECTION LAR-SPF LAR-OAPF INCLUSION PROCESS SYSTEM MANAGER
Experiments • LAR: One-sided Block Jacobi Method to solve the Symmetric Eigenvalue Problem. Platform: SGI Origin 2000 • LAR: Gaussian elimination. Platform: NoW (heterogeneous system) • LAR: block LU factorization. Platforms: IBM SP2, SGI Origin 2000, NoW Basic Libraries: reference BLAS, machine BLAS, ATLAS
Jacobi on Origin 2000 Comparison of execution times using different sets of Algorithm Parameters (8 processors)
LU on IBM SP2 Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 and 8 processors.
LU on Origin 2000 Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4, 8 and 16 processors.
LU on NoW Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 processors. Using machine BLAS and ATLAS as basic libraries.
Gaussian elimination on Heterogeneous NoW Homogeneous Hybrid Heterogeneous Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time
Future Works • We try to develop a methodology valid for a wide range of systems, and to include it in the design of linear algebra libraries: it is necessary to analyse the methodology in more systems and with more routines • The Basic Linear Algebra Library to use can be considered as another parameter • An installation strategy common to a set of routines must be developed • At the moment we are analysing routines individually, but it could be preferable to analyse algorithmic schemes
http://www.carm.es/ctyc/turismo/turismoenlaregion/TURISMO/turismo.htmhttp://www.carm.es/ctyc/turismo/turismoenlaregion/TURISMO/turismo.htm