490 likes | 621 Views
Solving Challenging Numerical Linear Algebra Algorithms Using GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University of Tennessee, Knoxville ISC’12 Tutorial, Hamburg. June 17, 2012. Outline. MAGMA: LAPACK for GPUs Methodology overview
E N D
Solving Challenging Numerical Linear Algebra Algorithms Using GPU AcceleratorsHatem LtaiefKAUST Supercomputing LaboratoryStanimireTomovUniversity of Tennessee, KnoxvilleISC’12 Tutorial, Hamburg June 17, 2012
Outline • MAGMA: LAPACK for GPUs • Methodology overview • Use both GPUs and multicore CPUs • MAGMA: from single to multiGPU support • One-sided factorizations and linear solvers • Two-sided factorizations and eigensolvers • Dynamic scheduling approaches to DLA • MAGMA algorithms with dynamic scheduling • Conclusions
MAGMA: LAPACK for GPUs • MAGMA • Matrix algebra for GPU and multicore architecture • To provide LAPACK/ScaLAPACK on hybrid architectures • http://icl.cs.utk.edu/magma/ • MAGMA BLAS • A subset of BLAS for GPUs, highly optimized for NVIDIA GPGPUs • Fast GEMM for Fermi (CUBLAS3.2) [IJHPCA’10] • MAGMA developers & collaborators • UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) • Community effort, similarly to LAPACK/ScaLAPACK
A New Generation of DLA Software MAGMA Hybrid Algorithms (heterogeneity friendly) Rely on - hybrid scheduler - hybrid kernels
MAGMA 1.1 • 50+ hybrid LAPACK algorithms have been developed (total of 200+ routines) • Every algorithm is in 4 precisions (s/c/d/z) • There are 3 mixed precision algorithms (zc & ds) • These are hybrid algorithms, expressed in terms of BLAS • MAGMA BLAS • A subset of GPU BLAS, optimized for Tesla and Fermi GPUs
MAGMA Methodology A methodology to use all available resources: • MAGMA usesHYBRIDIZATION methodology based on • Representing linear algebra algorithms as collections of TASKS and DATA DEPENDENCIES among them • Properly SCHEDULING tasks' execution over multicore and GPU hardware components • Successfully applied to fundamentallinear algebra algorithms • One and two-sided factorizations and solvers • Iterative linear and eigen-solvers • Productivity • 1) High-level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions Hybrid CPU+GPU algorithms(small tasks for multicores and large tasks for GPUs)
Hybrid Algorithms One-sided factorizations (LU, QR, Cholesky) • Hybridization • Panels (Level 2 BLAS) are factored on CPU using LAPACK • Trailing matrix updates (Level 3 BLAS) are done on the GPU using “look-ahead”
A Hybrid Algorithm Example • Left-looking hybrid Cholesky factorization in MAGMA 1.0 • The difference with LAPACK – the 3 additional lines in red • Line 10 (done on CPU) is overlapped with work on the GPU
From single to multiple GPUs support • Data distribution • 1-D block-cyclic distribution • Algorithm • GPU holding current panel is sendingit to CPU • All updates are done in parallel onthe GPUs • Look-ahead is done with GPU holdingthe next panel nb GPU0 GPU1 GPU2 GPU0 . . .
LU factorization (multiGPUs) Matrix out ofGPU memory
Out of GPU Memory Algorithms • Perform left-looking factorizations on sub-matrices that fit in the GPU memory (using existing algorithms) • The rest of the matrix stays on the CPU • Left-looking versions minimize writing on the CPU Copy A2 to the GPU Update A2 using A1 (a panel of A1 at a time) Factor the updated A2 using existinghybrid code Copy factored A2 to the CPU Trivially extended to multiGPUs: A2 is “larger” with 1-D block cyclic distribution, again reusing existing algorithms Untouched part of thematrix To be factoredsub-matrixA2 on GPU Factoredsub-matricA1 on CPU . . .
Hybrid Algorithms Two-sided factorizations (to bidiagonal, tridiagonal, and upper Hessenberg forms) for eigen- and singular-value problems • Hybridization • Trailing matrix updates (Level 3 BLAS) are done on the GPU(similar to the one-sided factorizations) • Panels (Level 2 BLAS) are hybrid– operations with memory footprint restricted to the panel are done on CPU– The time consuming matrix-vector products involving the entire trailing matrix are done on the GPU
MultiGPUs Two-Sided Factorizations • Need HP multiGPU Level 2 BLAS (e.g., 50% of flops in the tridiagonal reduction) Performance of DSYMV on multi M2090s • T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.
MAGMA Tridiagonalization in DP • 50 % of the flops are in SYMV • Memory bound, i.e. does not scale well on multicore CPUs • Use the GPU’s high memory bandwidth and optimized SYMV • 8 x speedup over 12 Intel cores(X5660 @2.8 GHz) Keeneland system, using one node 3 NVIDIA GPUs (M2070@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB) • T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.
Further GPU Kernel Optimizations Fermi 2070 A. Abdelfattah, J. Dongarra, D. Keyes and H. Ltaief, Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators, VECPAR, Japan, 2012.
Further GPU Kernel Optimizations Fermi 2070
Further GPU Kernel Optimizations Fermi 2070
From Static to Dynamic Scheduling… • Static may stall in situations where work is available • Hand tuned optimizations • Hardware heterogeneity • Kernel heterogeneity • Separation of concerns • Dynamic Runtime System
Punch Lines Productivity! Productivity! Productivity! Oh… Did I say Productivity?
Block Algorithms • Panel-Update Sequence • Transformations are blocked/accumulated within the Panel (Level 2 BLAS) • Transformations applied at once on the trailing submatrix (Level 3 BLAS) • Parallelism hidden inside the BLAS • Fork-join Model
Leveraging Block Algorithms… Column-major data layout Tile data layout
Lessons Learnt from PLASMA • PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures: http://icl.cs.utk.edu/plasma/ • Tile Algorithms on homogeneous x86 cores • Parallelism is brought to the fore • May require the redesign of linear algebra algorithms • Tile data layout translation • Remove unnecessary synchronization points between Panel-Update sequences • DAG execution where nodes represent tasks and edges define dependencies between them • Dynamic runtime system environment QUARK
Example: Tile QR Factorization First panel factorization and corresponding updates DAG for a 4x4 tiles matrix
Let’s go crazy! DAG of 20x20 tile QR Factorization
Dynamic Scheduling • Conceptually similar to out-of-order processor scheduling because it has: • Dynamic runtime DAG scheduler • Out-of-order execution flow of fine-grained tasks • Task scheduling as soon as dependencies are satisfied • Producer-Consumer • Data Flow Programming Model: five decades old concept • Think "how things connect" rather than "how things happen” • Assembly line • Inherently parallel
Matrices Over Runtime Systems at Exascale (MORSE) • Mission statement: "Design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale Hybrid systems” • Runtime challenges due to the ever growing hardware complexity • Algorithmic challenges to exploit the hardware capabilities at most • Integrated into MAGMA software stack
MAGMA-MORSE: x86 + MultiGPUs • Lessons Learned from PLASMA! • CUDA-based hybrid systems • New high performance numerical kernels • StarPU Runtime System (Augonnet et. Al, INRIA, Bordeaux) • Both: x86 and GPUs=> Hybrid Computations • Similar to LAPACK in functionality
Achieving High Level of Productivity • From Sequential Nested-Loop Code to Parallel Execution for (k = 0; k < min(MT, NT); k++){ zgeqrt(A[k;k], ...); for (n = k+1; n < NT; n++) zunmqr(A[k;k], A[k;n], ...); for (m = k+1; m < MT; m++){ ztsqrt(A[k;k],,A[m;k], ...); for (n = k+1; n < NT; n++) ztsmqr(A[m;k], A[k;n], A[m;n], ...); } }
Achieving High Level of Productivity • From Sequential Nested-Loop Code to Parallel Execution for (k = 0; k < min(MT, NT); k++){ starpu_Insert_Task(&cl_zgeqrt, k , k, ...); for (n = k+1; n < NT; n++) starpu_Insert_Task(&cl_zunmqr, k, n, ...); for (m = k+1; m < MT; m++){ starpu_Insert_Task(&cl_ztsqrt, m, k, ...); for (n = k+1; n < NT; n++) starpu_Insert_Task(&cl_ztsmqr, m, n, k, ...); } }
Hybrid Architecture Targeted • ⇒ PCI Interconnect 16X 64Gb/s, very thin pipe! • ⇒ Fermi C2050 448 cuda cores 515 Gflop/s
Performance Charts • Cholesky • QR • LU • Symmetric Matrix Inversion
Cholesky Performance 8 Intel x86 cores + 3 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, S. Thibault, R. Namyst, S. Thibault, S. Tomov, Software for GPUs, GPU Computing GEMs, vol.2, 2011.
QR Performance 8 AMD x86 cores + 4 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and Distributed Processing Symposium, 2011.
QR Performance +~200 Gflop/s but 12 cores = ~150 Gflop/s 8 AMD x86 cores + 4 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and Distributed Processing Symposium, 2011.
Performance Breakdown • Task distribution observed on StarPU: • sgeqrt : 20% of tasks on GPUs • stsmqr : 92.5% of tasks on GPUs • Taking advantage of heterogeneity ! • Only do what you are good for • Don’t do what you are not good for
LU Performance 8 Intel x86 cores + 3 Tesla GPUs E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief and S. Tomov, ACS/IEEE International Conference on Computer Systems and Applications (best paper award), 2011
Symmetric Matrix Inversion • A−1, Seriously??? • YES! • Critical component of the variance-covariance matrix computation in statistics • Three steps: • Cholesky factorization (DPOTRF) • Inverting the Cholesky factor (DTRTRI) • Calculating the product of the inverted Cholesky factor with its transpose (DLAUUM) Built on previous work from E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou, and L. Rosenberg
A−1 Performance 8 Intel x86 cores + 2 Fermi GPUs H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC'11, India
Summary and Future Directions • Two methodologies for solving challenging DLA • Static Scheduling (performance) • Dynamic Scheduling (productivity) • LAPACK compliant API • Source codes freely available in MAGMA • What’s next? • Extended numerical functionality • Distributed Memory Systems
Colloborators / Support • MAGMA [Matrix Algebra on GPUand Multicore Architectures] teamhttp://icl.cs.utk.edu/magma/ • PLASMA [Parallel Linear Algebrafor Scalable Multicore Architectures] teamhttp://icl.cs.utk.edu/plasma • Collaborating partners University of Tennessee, KnoxvilleUniversity of California, BerkeleyUniversity of Colorado, DenverINRIA, FranceKAUST, Saudi Arabia