1 / 56

Advances in the Optimization of Parallel Routines (I)

Learn about the optimization of parallel routines for linear algebra, installation, autotuning, library modifications, polylibraries, algorithmic schemes, heterogeneous systems, and peer-to-peer computing.

Download Presentation

Advances in the Optimization of Parallel Routines (I)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo Universidad Politécnica de Valencia

  2. Outline • A little history • Modelling Linear Algebra Routines • Installation routines • Autotuning routines • Modifications to libraries’ hierarchy • Polylibraries • Algorithmic schemes • Heterogeneous systems • Peer to peer computing Universidad Politécnica de Valencia

  3. Collaborations and autoreferences • Modelling Linear Algebra Routines • + J. Cuenca + J. González: • Modelling the Behaviour of Linear Algebra Algorithms with Message-passing. 2001 • Towards the Design of an Automatically Tuned Linear Algebra Library. 2002 • + J. Cuenca + L. P. García + J. González + A. Vidal: • Empirical Modelling of Parallel Linear Algebra Routines. 2003 Universidad Politécnica de Valencia

  4. Colaborations and autoreferences • Installation routines • + G. Carrillo: • Installation routines for linear algebra libraries on LANs. 2000 • + G. Carrillo + J. Cuenca + J. González: • Optimización automática de rutinas paralelas de álgebra lineal. 2000 Universidad Politécnica de Valencia

  5. Colaborations and autoreferences • Autotuning routines • + J. Cuenca + J. González: • Automatic parameterization of parallel linear algebra routines. 2001 • + J. Cuenca: • Some considerations about the Automatic Optimization of Parallel Linear Algebra Routines. 2002 Universidad Politécnica de Valencia

  6. Colaborations and autoreferences • Modifications to the libraries hierarchy • + J. Cuenca + J. González: • Architecture of an Automatic Tuned Linear Algebra Library. 2002 - 2004 Universidad Politécnica de Valencia

  7. Colaborations and autoreferences • Polylibraries • + P. Alberti + P. Alonso + J. Cuenca + A. Vidal: • Designing Polylibraries to Speed Up Parallel Computations. 2003 Universidad Politécnica de Valencia

  8. Colaborations and autoreferences • Algorithmic schemes • + J. P. Martínez: • Automatic Optimization in Parallel Dynamic Programming Schemes. 2004 Universidad Politécnica de Valencia

  9. Colaborations and autoreferences • Heterogeneous systems • + J. Cuenca + J. Dongarra + J. González + K. Roche: • Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003 • + J. Cuenca + J. P. Martínez: • Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004 Universidad Politécnica de Valencia

  10. Outline • A little history • Modelling Linear Algebra Routines • Installation routines • Autotuning routines • Modifications to libraries’ hierarchy • Polylibraries • Algorithmic schemes • Heterogeneous systems • Peer to peer computing Universidad Politécnica de Valencia

  11. A little history • Parallel optimization in the past: • Hand-optimization for each platform • Time consuming • Incompatible with hardware evolution • Incompatible with changes in the system (architecture and basic libraries) • Unsuitable for systems with variable workloads • Misuse by non expert users Universidad Politécnica de Valencia

  12. A little history • Initial solutions to this situation: • Problem-specific solutions • Polyalgorithms • Installation tests Universidad Politécnica de Valencia

  13. A little history • Problem specific solutions: • Brewer (1994): Sorting Algorithms, Differential Equations • Frigo (1997): FFTW: The Fastest Fourier Transform in the West • LAWRA (1997): Linear Algebra With Recursive Algorithms Universidad Politécnica de Valencia

  14. A little history • Polyalgorithms: • Brewer • FFTW • PHiPAC (1997): Linear Algebra Universidad Politécnica de Valencia

  15. A little history • Installation tests: • ATLAS (2001): Dense Linear Algebra, sequential • Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm • I-LIB (2000): some parallel linear algebra routines Universidad Politécnica de Valencia

  16. A little history • Parallel optimization today: • Optimization based on computational kernels • Systematic development of routines • Auto-optimization of routines • Middleware for auto-optimization Universidad Politécnica de Valencia

  17. A little history • Optimization based on computational kernels: • Efficient kernels (BLAS) and algorithms based on these kernels • Auto-optimization of the basic kernels (ATLAS) Universidad Politécnica de Valencia

  18. A little history • Systematic development of routines: • FLAME project • R. van de Geijn + E. Quintana + … • Dense Linear Algebra • Based on Object Oriented Design • LAWRA • Dense Linear Algebra • For Shared Memory Systems Universidad Politécnica de Valencia

  19. A little history • Auto-optimization of routines: • At installation time: • ATLAS, Dongarra + Whaley • I-LIB, Kanada + Katagiri + Kuroda • SOLAR, Cuenca + Giménez + González • LFC, Dongarra + Roche • At execution time: • Solve a reduced problem in each processor (Kalinov + Lastovetsky) • Use a system evaluation tool (NWS) Universidad Politécnica de Valencia

  20. A little history • Middleware for auto-optimization: • LFC: • Middleware for Dense Linear Algebra Software in Clusters. • Hierarchy of autotuning libraries: • Include in the libraries installation routines to be used in the development of higher level libraries • FIBER: • Proposal of general middleware • Evolution of I-LIB • mpC: • For heterogeneous systems Universidad Politécnica de Valencia

  21. A little history • Parallel optimization in the future?: • Skeletons and languages • Heterogeneous and variable-load systems • Distributed systems • P2P computing Universidad Politécnica de Valencia

  22. A little history • Skeletons and languages: Develop skeletons for parallel algorithmic schemes together with execution time models and provide the users with these libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa) Universidad Politécnica de Valencia

  23. A little history • Heterogeneous and variable-load systems: Heterogeneous algorithms: unbalanced distribution of data (static or dynamic) Homogeneous algorithms: more processes than processors and assignation of processes to processors (static or dynamic) Variable-load systems as dynamic heterogeneous Universidad Politécnica de Valencia

  24. A little history • Distributed systems: Intrinsically heterogeneous and variable-load Very high cost of communications Necessary special middleware (Globus, NWS) There can be servers to attend queries of clients Universidad Politécnica de Valencia

  25. A little history • P2P computing: Users can go in and out dynamically All the users are the same type (initially) Is distributed, heterogeneous and variable-load But special middleware is necessary Universidad Politécnica de Valencia

  26. Outline • A little story • Modelling Linear Algebra Routines • Installation routines • Autotuning routines • Modifications to libraries’ hierarchy • Polylibraries • Algorithmic schemes • Heterogeneous systems • Peer to peer computing Universidad Politécnica de Valencia

  27. Modelling Linear Algebra Routines Necessary to predict accurately the execution time and select • The number of processes • The number of processors • Which processors • The number of rows and columns of processes (the topology) • The processes to processors assignation • The computational block size (in linear algebra algorithms) • The communication block size • The algorithm (polyalgorithms) • The routine or library (polylibraries) Universidad Politécnica de Valencia

  28. Modelling Linear Algebra Routines Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation, ... : overlapping of communication and computation Universidad Politécnica de Valencia

  29. Modelling Linear Algebra Routines Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. Universidad Politécnica de Valencia

  30. Modelling Linear Algebra Routines The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors Universidad Politécnica de Valencia

  31. Modelling Linear Algebra Routines And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (tc) and the start-up (ts) and word-sending time (tw) Universidad Politécnica de Valencia

  32. A11 A12 A13 A21 A22 A23 A31 A32 A33 Modelling Linear Algebra Routines LU factorisation (Golub - Van Loan): = Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems) Step 3: (multiple upper triangular systems) Step 4: (update south-east blocks) U11 U12 U13 L11 U22 U23 L21 L22 U33 L31 L32 L33 Universidad Politécnica de Valencia

  33. Modelling Linear Algebra Routines The execution time is If the blocks are of size 1, the operations are all with individual elements, but if the blocks size is b the cost is With k3 and k2 the cost of operations performed with BLAS 3 or 2 Universidad Politécnica de Valencia

  34. Modelling Linear Algebra Routines But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as: Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and ... Universidad Politécnica de Valencia

  35. Modelling Linear Algebra Routines The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b) The formula has the form: And what we want is to obtain the values of AP with which the lowest execution time is obtained Universidad Politécnica de Valencia

  36. Modelling Linear Algebra Routines The values of the System Parameters could be obtained • With installation routines associated to each linear algebra routine • From information stored when the library was installed in the system, thus generating a hierarchy of libraries with auto-optimization • At execution time by testing the system conditions prior to the call to the routine Universidad Politécnica de Valencia

  37. Modelling Linear Algebra Routines These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time Universidad Politécnica de Valencia

  38. Modelling Linear Algebra Routines Parallel block LU factorisation: matrix distribution of computations in the first step processors Universidad Politécnica de Valencia

  39. Modelling Linear Algebra Routines Distribution of computations on successive steps: second step third step Universidad Politécnica de Valencia

  40. Modelling Linear Algebra Routines The cost of parallel block LU factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r c d=max(r,c) System Parameters: cost of arithmetic operations: k2,getf2k3,trsmm k3,gemm communication parameters:tstw Universidad Politécnica de Valencia

  41. Modelling Linear Algebra Routines The cost of parallel block QR factorisation: Tuning Algorithmic Parameters: block size: b 2D-mesh of p proccesors: p = r c System Parameters: cost of arithmetic operations:k2,geqr2 k2,larft k3,gemmk3,trmm communication parameters:tstw Universidad Politécnica de Valencia

  42. Modelling Linear Algebra Routines The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR) and a common format is necessary to store the information Universidad Politécnica de Valencia

  43. Modelling Linear Algebra Routines Universidad Politécnica de Valencia

  44. 80,00 mean 70,00 model 60,00 optimum 50,00 40,00 time (seconds) 30,00 20,00 10,00 0,00 512 1024 1536 2048 2560 3072 3584 problem size Modelling Linear Algebra Routines Parallel QR factorisation “mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user) “optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters “model” is the execution time with the values selected with the model IBM-SP2. 8 processors Universidad Politécnica de Valencia

  45. IBM SP2 p=4 p=8 b r c b r c 1024 16 1 4 16 1 8 2048 32 1 4 16 1 8 3072 32 1 4 32 2 4 4096 32 1 4 32 2 4 p=4 p=8 b r c b r c 1024 32 4 1 32 4 2 Origin 2000 2048 64 4 1 32 4 2 3072 32 4 2 4096 64 4 2 - Modelling Linear Algebra Routines Parameter selection for the QR algorithm - p=4 p=8 b r c b r c 1024 16 1 4 16 1 8 2048 16 1 4 16 1 8 Network of Pentium III with Fast Ethernet 3072 32 1 4 32 1 8 4096 32 1 4 32 1 8 Universidad Politécnica de Valencia

  46. Outline • A little history • Modelling Linear Algebra Routines • Installation routines • Autotuning routines • Modifications to libraries’ hierarchy • Polylibraries • Algorithmic schemes • Heterogeneous systems • Peer to peer computing Universidad Politécnica de Valencia

  47. Installation Routines In the formulas (parallel block LU factorisation) The values of the System Parameters (k2,getf2 ,k3,trsmm , k3,gemm ,ts ,tw) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b,r,c) Universidad Politécnica de Valencia

  48. Installation Routines By running at installation time Installation Routines associated to the linear algebra routine And storing the information generated to be used at running time  Each linear algebra routine must be designed together with the corresponding installation routines, and the installation process must be detailed Universidad Politécnica de Valencia

  49. Installation Routines is estimated by performing matrix-matrix multiplications and updatings of size (n/r b)  (b n/c) Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes Universidad Politécnica de Valencia

  50. Installation Routines two multiple triangular systems are solved, one upper triangular of size b n/c , and another lower triangular of size n/r b Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r As for the previous parameter, values can be obtained for different problem sizes Universidad Politécnica de Valencia

More Related