Auto-optimization of linear algebra parallel routines: the Cholesky factorization

Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica Universidad Politécnica de Cartagena, Spain luis.garcia@sait.upct.es Javier Cuenca Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia, Spain javiercm@ditec.um.es Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain domingo@dif.um.es Parco 2005

Outline • Introduction • Parallel routine for the Cholesky factorization • Experimental Results • Conclusions Parco 2005

Introduction • Our Goal: to obtain linear algebra parallel routines with auto-optimization capacity • The approach: model the behavior of the algorithm • This work: improve the model for the communication costswhen: • The routine uses different types of MPI communication mechanisms • The system has more than one interconnection network • The communication parameters vary with the volume of the communication Parco 2005

Introduction • Theoretical and experimental study of the algorithm. AP selection. • In linear algebra parallel routines, typical AP and SP are: • b, p = r x c and the basic library • k1,k2, k3, ts and tw • An analytical model of the execution time • T(n) = f(n,AP,SP) Parco 2005

Parallel Cholesky factorization • The n x n matrix is mapped through a block cyclic 2-D distribution onto a two-dimensional mesh of p = r x c processes (in ScaLAPACK style) (a) First step (b) Second step (c) Third step Figure 1. Work distribution in the first three steps, with n/b = 6 and p = 2 x 3 Parco 2005

Parallel Cholesky factorization • The general model: t(n) = f(n,AP,SP) • Problem size: • n matrix size • Algorithmic parameters (AP): • bblock size • p = r x c processes • System parameters (SP) SP = g(n,AP): • k(n,b,p): k2potf2, k3trsm, k3gemm and k3syrk cost of basic arithmetic operations • ts(p)start-up time • tws(n,p), twd(n,p) word-sending time for different types of communications tcom(n,p) = ts(p)+ntw(n,p) Parco 2005

Parallel Cholesky factorization • Theoretical model: • Arithmetic cost: • Communication cost: T = tarit + tcom Parco 2005

Experimental Results • Systems: • A network of four nodes Intel Pentium 4 (P4net) with a FastEthernet switch, enabling parallel communications between them. The MPI library used is MPICH • A network of four nodes HP AlphaServer quad processors (HPC160) using Shared Memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc) for the communications between processes. A MPI library optimized for Shared Memory and for MemoryChannel has been used. Parco 2005

Experimental Results • How to estimate the arithmetic SPs • With routines performing some basic operation (dgemm, dsyrk, dtrsm) with the same data access scheme used in the algorithm • How to estimate the communication SPs • With routines that communicate rows or columns in the logical mesh of processes: • With a broadcast for MPI derived data type between processes in the same column • With a broadcast for MPI predefined data type between processes in the same row • In both cases the experiments are repeated several times, to obtain an average value Parco 2005

Experimental Results • Lowest execution time with the optimized version of BLAS and LAPACK for Pentium 4 and for Alpha Table 1. Values of arithmetic system parameters (in µsec) in Pentium 4 with BLASopt Table 2. Values of arithmetic system parameters (in µsec) in Alpha with CXML Parco 2005

Experimental Results • But other SPs can depend on n and b, for example: k2,potf2 Table 3. Values of k2,potf2 (in µsec) in Pentium 4 with BLASopt Table 4. Values of k2,potf2 (in µsec) in Alpha with CXML Parco 2005

Experimental Results • Communication system parameters • Broadcast cost for MPI predefined data type,tws Table 6. Values of tws (in µsecs) in HPC160 Table 5. Values of tws (in µsecs) in P4net Parco 2005

Experimental Results • Communication system parameters • Word sending time of a broadcast for MPI derived data type twd Table 7. Values of twd (in µsecs) obtained experimentally for different b and p Parco 2005

Experimental Results • Communication system parameters • Startup time of MPI broadcast ts • Can be considered ts(n,p) ≈ ts(p) Table 8. Values of ts(in µsecs) obtained experimentally for different number of processes Parco 2005

Experimental Results • P4net Parco 2005

Experimental Results • HPC160smp Parco 2005

Experimental Results • Parameters selection in P4net Table 9. Parameters selection for the Cholesky factorization in P4net Parco 2005

Experimental Results • Parameters Selection in HPC160 Table 10. Parameters selection for the Cholesky factorization in HPC160 with shared memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc) Parco 2005

Conclusions • The method has been applied successfully to the Cholesky factorization and can be applied to other linear algebra routines • It is neccesary to use different costs for different types of MPI communication mechanisms. • and to use different cost for the communication parameters in systems with more than one interconnection network. • It is necessary to decide the optimal allocation of processes by node, according to the speed of the interconnection networks. Hybrid systems. Parco 2005

Auto-optimization of linear algebra parallel routines: the Cholesky factorization