1 / 19

Auto-optimization of linear algebra parallel routines: the Cholesky factorization

Auto-optimization of linear algebra parallel routines: the Cholesky factorization. Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica Universidad Politécnica de Cartagena, Spain luis.garcia@sait.upct.es. Javier Cuenca Departamento de Ingeniería y Tecnología de Computadores

amora
Download Presentation

Auto-optimization of linear algebra parallel routines: the Cholesky factorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica Universidad Politécnica de Cartagena, Spain luis.garcia@sait.upct.es Javier Cuenca Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia, Spain javiercm@ditec.um.es Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain domingo@dif.um.es Parco 2005

  2. Outline • Introduction • Parallel routine for the Cholesky factorization • Experimental Results • Conclusions Parco 2005

  3. Introduction • Our Goal: to obtain linear algebra parallel routines with auto-optimization capacity • The approach: model the behavior of the algorithm • This work: improve the model for the communication costswhen: • The routine uses different types of MPI communication mechanisms • The system has more than one interconnection network • The communication parameters vary with the volume of the communication Parco 2005

  4. Introduction • Theoretical and experimental study of the algorithm. AP selection. • In linear algebra parallel routines, typical AP and SP are: • b, p = r x c and the basic library • k1,k2, k3, ts and tw • An analytical model of the execution time • T(n) = f(n,AP,SP) Parco 2005

  5. Parallel Cholesky factorization • The n x n matrix is mapped through a block cyclic 2-D distribution onto a two-dimensional mesh of p = r x c processes (in ScaLAPACK style) (a) First step (b) Second step (c) Third step Figure 1. Work distribution in the first three steps, with n/b = 6 and p = 2 x 3 Parco 2005

  6. Parallel Cholesky factorization • The general model: t(n) = f(n,AP,SP) • Problem size: • n matrix size • Algorithmic parameters (AP): • bblock size • p = r x c processes • System parameters (SP) SP = g(n,AP): • k(n,b,p): k2potf2, k3trsm, k3gemm and k3syrk cost of basic arithmetic operations • ts(p)start-up time • tws(n,p), twd(n,p) word-sending time for different types of communications tcom(n,p) = ts(p)+ntw(n,p) Parco 2005

  7. Parallel Cholesky factorization • Theoretical model: • Arithmetic cost: • Communication cost: T = tarit + tcom Parco 2005

  8. Experimental Results • Systems: • A network of four nodes Intel Pentium 4 (P4net) with a FastEthernet switch, enabling parallel communications between them. The MPI library used is MPICH • A network of four nodes HP AlphaServer quad processors (HPC160) using Shared Memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc) for the communications between processes. A MPI library optimized for Shared Memory and for MemoryChannel has been used. Parco 2005

  9. Experimental Results • How to estimate the arithmetic SPs • With routines performing some basic operation (dgemm, dsyrk, dtrsm) with the same data access scheme used in the algorithm • How to estimate the communication SPs • With routines that communicate rows or columns in the logical mesh of processes: • With a broadcast for MPI derived data type between processes in the same column • With a broadcast for MPI predefined data type between processes in the same row • In both cases the experiments are repeated several times, to obtain an average value Parco 2005

  10. Experimental Results • Lowest execution time with the optimized version of BLAS and LAPACK for Pentium 4 and for Alpha Table 1. Values of arithmetic system parameters (in µsec) in Pentium 4 with BLASopt Table 2. Values of arithmetic system parameters (in µsec) in Alpha with CXML Parco 2005

  11. Experimental Results • But other SPs can depend on n and b, for example: k2,potf2 Table 3. Values of k2,potf2 (in µsec) in Pentium 4 with BLASopt Table 4. Values of k2,potf2 (in µsec) in Alpha with CXML Parco 2005

  12. Experimental Results • Communication system parameters • Broadcast cost for MPI predefined data type,tws Table 6. Values of tws (in µsecs) in HPC160 Table 5. Values of tws (in µsecs) in P4net Parco 2005

  13. Experimental Results • Communication system parameters • Word sending time of a broadcast for MPI derived data type twd Table 7. Values of twd (in µsecs) obtained experimentally for different b and p Parco 2005

  14. Experimental Results • Communication system parameters • Startup time of MPI broadcast ts • Can be considered ts(n,p) ≈ ts(p) Table 8. Values of ts(in µsecs) obtained experimentally for different number of processes Parco 2005

  15. Experimental Results • P4net Parco 2005

  16. Experimental Results • HPC160smp Parco 2005

  17. Experimental Results • Parameters selection in P4net Table 9. Parameters selection for the Cholesky factorization in P4net Parco 2005

  18. Experimental Results • Parameters Selection in HPC160 Table 10. Parameters selection for the Cholesky factorization in HPC160 with shared memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc) Parco 2005

  19. Conclusions • The method has been applied successfully to the Cholesky factorization and can be applied to other linear algebra routines • It is neccesary to use different costs for different types of MPI communication mechanisms. • and to use different cost for the communication parameters in systems with more than one interconnection network. • It is necessary to decide the optimal allocation of processes by node, according to the speed of the interconnection networks. Hybrid systems. Parco 2005

More Related