Numerical Libraries in High Performance Computing

Numerical Libraries in High Performance Computing Dmitry Pekurovsky dmitry@sdsc.edu CIEG workshop, April 25 2007

Tutorial Outline • General comments • Overview of commonly used libraries • MASS • ScaLAPACK and its components • BLAS, BLACS, PBLAS,LAPACK • PETSc • NAG • ESSL, PESSL • FFTW • P3DFFT • Lab assignment

Numerical libraries • Use numerical (a.k.a. math or scientific) libraries whenever possible • Minimize code development effort • Improved performance • Robust, accurate, up-to-date numerical algorithms • Collection (archive) of precompiled routines • Use nm –a to look inside if needed • Linking:mpxlf … –l /lib/loc/libnum.a OR mpxlf … -L/lib/loc –l lnum • Reminder: order of linking matters • In case two libraries contain the same routine name, the first one to appear on the linking command line gets used • -l options should come after the object files (.o) of your source • If library A calls routines from library B, -lA has to come before -lB.

Calling Fortran Libraries from C/C++ • Fortran uses column-major, C/C++ uses row-major array storage • In C array counts begin with 0, but in Fortran they begin with 1 • Pass arguments by reference • Pass characters as strings • In C++, prototype functions to be called as follows: • extern “Fortran” func(int &n,…); • Link with –lxlf90_r flag added • More details: see http://www.sdsc.edu/us/newsletter/common/newsletter.php?issue=2005-10&corner=softwareor ask SDSC consulting

Third-party libraries • Advantages: • Portable code • Extended set of features • Disadvantages: • Suboptimal performance • Sometimes costly, though many are free • Sometimes, need to install yourself • A good set is already installed on SDSC systems • Look in /usr/local/apps directory for 64-bit apps, in /usr/local/apps32 for 32-bit apps • Examples: ScaLapack, PETSc, FFTW, NAG

Vendor-supplied libraries (specific to a given architecture) • Advantages: • Highly optimized performance (for a given processor, memory system etc) • Installed, tested, supported • Usually free for the user • Disadvantages: • Sometimes code not portable to other platforms • Some features may not be implemented • Examples on IBM: ESSL, MASS • Examples on Intel: MKL

MASS – Mathematical Acceleration SubSystem • IBM product • Scalar Library Functions • Replace standard system intrinsics from libm • Call from Fortran or C • No source code modification required • Double precision routines: sin cos tan atan sqrt sinh cosh tanh atan2 rsqrt exp dnint log pow(C/C++) x**y(Fortran) • Provide better performance • Sometimes less accurate results (usually in the last bit) • Linking: -L/usr/local/apps/mass -lmass

Scalar MASS Performance (cycles per call, length 1000 loop) F ==604e=== ==pwr3=== ==pwr4+== ==pwr5===u S S S Sn p p p pc R e e e et a l m e l m e l m e l m ei n i a d i a d i a d i a do g b s u b s u b s u b s un e m s p m s p m s p m s p----------------------------------------------------------acos B 91 91 1.0 70 69 1.0 108 116 0.9 113 115 1.0acosh G 318 182 1.7 256 159 1.6 483 253 1.9 527 248 2.1asin B 84 83 1.0 63 63 1.0 100 101 1.0 112 116 1.0asinh D 352 232 1.5 264 178 1.5 463 301 1.5 482 286 1.7atan2 D 601 103 5.8 441 79 5.6 754 100 7.5 847 95 8.9atanh B 244 143 1.7 186 116 1.6 300 169 1.8 312 165 1.9cbrt D 279 282 1.0 216 216 1.0 328 333 1.0 335 336 1.0cos B 53 25 2.1 38 16 2.4 53 21 2.5 65 29 2.2cos D 76 45 1.7 58 37 1.6 74 58 1.3 86 60 1.4cosh D 191 45 4.2 154 32 4.8 237 46 5.2 255 50 5.1cosisin B 111 58 1.9 82 34 2.4 101 46 2.2 123 50 2.5cosisin D 161 97 1.7 125 77 1.6 151 102 1.5 170 105 1.6div D 32 32 1.0 8.8 8.8 1.0 28 28 1.0 29 29 1.0exp D 120 35 3.4 106 27 3.9 162 35 4.6 168 40 4.2expm1 D 133 41 3.2 111 28 4.0 218 43 5.1 198 43 4.6

Scalar MASS Performance (cycles per call, length 1000 loop), cont. F ==604e=== ==pwr3=== ==pwr4+== ==pwr5===u S S S Sn p p p pc R e e e et a l m e l m e l m e l m ei n i a d i a d i a d i a do g b s u b s u b s u b s un e m s p m s p m s p m s plog C 134 55 2.4 105 52 2.0 204 84 2.4 210 86 2.4log10 C 133 56 2.4 107 57 1.9 206 90 2.3 207 92 2.3log1p H 157 58 2.7 124 47 2.6 201 68 3.0 220 70 3.1pow C 309 113 2.7 235 90 2.6 453 146 3.1 463 150 3.1qdrt C 134 98 1.4 116 88 1.3 250 156 1.6 283 159 1.8rcbrt D 309 309 1.0 236 237 1.0 354 354 1.0 362 363 1.0rec D 32 32 1.0 9.2 8.8 1.0 29 29 1.0 29 29 1.0rqdrt C 149 100 1.5 126 95 1.3 243 155 1.6 276 153 1.8rsqrt C 86 52 1.7 70 52 1.3 120 73 1.6 156 74 2.1sin B 52 27 1.9 37 16 2.3 54 24 2.3 66 32 2.1sin D 78 45 1.7 61 37 1.6 77 57 1.4 82 60 1.4sincos B 108 56 1.9 80 34 2.4 101 48 2.1 121 54 2.2sinh D 250 50 5.0 197 35 5.6 345 52 6.6 329 55 6.0sqrt C 69 50 1.4 59 46 1.3 128 72 1.8 143 73 2.0tan D 137 72 1.9 112 43 2.6 194 62 3.1 183 64 2.9tanh F 256 64 4.0 199 47 4.2 357 65 5.5 333 67 5.0

Vector MASS library - MASSV • Provides optimized versions of intrinsic functions for vectorized operations • Example for(i=0; i < N; i++) Z[i] = exp(x[i]); Replace with vexp(Z,x,N); • Takes advantage of pipelining Z(3)=exp(x(3)) Z(4)=exp(x(4)) Z(1)=exp(x(1)) Z(2)=exp(x(2)) Z(1)=exp(x(1)) Z(2)=exp(x(2)) Z(3)=exp(x(3)) Z(4)=exp(x(4))

Vector MASS library cont. • Some source code modification required • Note: accuracy may differ even from scalar MASS functions (usually in the last bit) • Library source code available for portability to other platforms • Linking: -lmassvp4 on power4 platforms, otherwise –lmassv • MASS URL: http://www-306.ibm.com/software/awdtools/mass/

Vector MASS Performance -- double precision functions (cycles per evaluation, length 1000 loop) F ==604e=== ==pwr3=== ==pwr4+== ==pwr5===u S m S m S m Sn p a p a p a pc R m e s e s e s et a l a e l s e l s e l s ei n i s d i v d i v d i v do g b s u b p u b p u b p un e m v p m 3 p m 4 p m 5 p---------------------------------------------------------------vacos B 91 41 2.2 70 17 4.1 108 24 4.5 113 23 4.9vacosh G 318 39 8.2 256 15 17.1 483 21 23.0 527 19 27.7vasin B 84 41 2.0 63 17 3.7 100 24 4.2 112 23 4.9vasinh D 352 44 8.0 264 18 14.7 463 23 20.1 482 22 21.9vatan2 D 601 40 15.0 441 27 16.3 754 60 12.6 847 57 14.9vatanh B 244 41 6.0 186 16 11.6 300 21 14.3 312 19 16.4vcbrt D 279 20 14.0 216 8.8 24.5 328 11 29.8 335 12 27.9vcos B 53 9.3 5.7 38 4 9.5 53 6.5 8.2 65 6.5 10.0vcos D 76 27 2.8 58 12 4.8 74 20 3.7 86 20 4.3vcosh D 191 25 7.6 154 9.2 16.7 237 14 16.9 255 13 19.6vcosisin B 111 19 5.8 82 8 10.3 101 13 7.8 123 12 10.3vcosisin D 161 29 5.6 125 12 10.4 151 21 7.2 170 20 8.5vdiv D 32 11 2.9 8.8 6.8 1.3 28 13 2.2 29 13 2.2 vexp D 120 16 7.5 106 6.4 16.6 162 14 11.6 168 13 12.9vexpm1 D 133 19 7.0 111 8 13.9 218 12 18.2 198 12 16.5vlog C 134 20 6.7 105 7.2 14.6 204 9.7 21.0 210 9.5 22.1

Vector MASS Performance - double precision functions (cycles per evaluation, length 1000 loop), cont. F ==604e=== ==pwr3=== ==pwr4+== ==pwr5===u S m S m S m Sn p a p a p a pc R m e s e s e s et a l a e l s e l s e l s ei n i s d i v d i v d i v do g b s u b p u b p u b p un e m v p m 3 p m 4 p m 5 p--------------------------------------------------------------- vlog10 C 133 19 7.0 107 7.6 14.1 206 10 20.6 207 9.9 20.9vlog1p H 157 26 6.0 124 11 11.3 201 13 15.5 220 12 18.3vpow C 309 52 5.9 235 20 11.8 453 29 15.6 463 30 15.4vqdrt C 134 15 8.9 116 6 19.3 250 8.2 30.5 283 8 35.4vrcbrt D 309 20 15.5 236 8.8 26.8 354 11 32.2 362 11 32.9vrec D 32 10 3.2 9.2 6.4 1.4 29 12 2.4 29 11 2.6vrqdrt C 149 14 10.6 126 6 21.0 243 7.8 31.2 276 7 39.4vrsqrt C 86 16 5.4 70 8.8 8.0 120 18 6.7 156 17 9.2vsin B 52 11 4.7 37 4.8 7.7 54 7.2 7.5 66 6.9 9.6vsin D 78 27 2.9 61 12 5.1 77 20 3.9 82 19 4.3vsinh D 250 27 9.3 197 10 19.7 345 15 23.0 329 13 25.3vsqrt C 69 16 4.3 59 8.8 6.7 128 17 7.5 143 17 8.4vtan D 137 31 4.4 112 13 8.6 194 19 10.2 183 17 10.8vtanh F 256 35 7.3 199 13 15.3 357 19 18.8 333 18 18.5

ScaLapack • Scalable Linear Algebra PACKage, http://www.netlib.org/scalapack • Developing team from University of Tennessee, University of California Berkeley, ORNL, Rice U.,UCLA, UIUC etc. • Support in Commercial Packages • NAG Parallel Library • IBM PESSL • CRAY Scientific Library • VNI IMSL • Fujitsu, HP/Convex, Hitachi, NEC • Handles dense and banded matrices

Serial Parallel

BLAS – Basic Linear Algebra Subprograms

BLAS, cont. • Use blocked operations • Optimized for a given memory hierarchy – e.g. good use of cache • Other libraries build on top of BLAS • On any given system, try to use a system-optimized version

LAPACK • Linear Algebra PACKage • System of linear equations • Eigenvalue problems • Built on top of BLAS • On SDSC systems, installed at /usr/local/apps/LAPACK

BLACS - Basic Linear Algebra Communication Subprograms • Built on top of MPI

BLACS basics nprow=1; npcol=3; order='r'; zero=0; blacs_get(zero,zero,icontxt); blacs_gridinit(icontxt,order,nprow,npcol); blacs_gridinfo(icontxt,nprow,npcol,myrow,mycol); • icontxt like MPI_COMM_WORLD • blacs_gridinit specifies topology • blacs_gridinfo gets processor information • Here we define a 3 element array of processors

PBLAS • Parallel version of BLAS • Built on top of BLAS and BLACS

ScaLapack data layout

Block-Cyclic Partitioning (1D)

1 1 1 2 1 3 1 5 1 6 1 4 1 7 1 8 1 9 G l o b a l M a t r i x 2 1 2 3 2 4 2 5 2 8 2 9 2 2 2 6 2 7 3 1 3 8 3 9 3 2 3 3 3 4 3 5 3 6 3 7 G l o b a l M a t r i x ( 9 x 9 ) 4 1 4 3 4 5 4 6 4 7 4 8 4 2 4 4 4 9 B l o c k S i z e = 2 x 2 5 1 5 2 5 3 5 4 5 5 5 6 5 7 5 8 5 9 C y c l i c o n 2 x 3 P E G r i d 6 1 6 2 6 4 6 5 6 7 6 9 6 3 6 6 6 8 7 1 7 2 7 3 7 6 7 7 7 4 7 5 7 8 7 9 8 1 8 2 8 6 8 7 8 8 8 9 8 3 8 4 8 5 9 1 9 3 9 4 9 7 9 8 9 9 9 2 9 5 9 6 1 1 1 2 1 3 1 5 1 6 1 7 1 8 1 4 1 9 2 1 2 8 2 3 2 4 2 9 2 5 2 2 2 7 2 6 5 1 5 8 5 4 5 6 5 2 5 7 5 3 5 9 5 5 6 1 6 2 6 7 6 8 6 3 6 4 6 9 6 5 6 6 9 1 9 2 9 7 9 8 9 3 9 4 9 9 9 5 9 6 P E ( 0 , 0 ) P E ( 0 , 1 ) P E ( 0 , 2 ) 3 1 3 2 3 7 3 8 3 3 3 4 3 9 3 5 3 6 4 1 4 2 4 7 4 8 4 3 4 4 4 9 4 5 4 6 7 1 7 2 7 7 7 3 7 6 7 8 7 4 7 9 7 5 8 1 8 2 8 7 8 8 8 9 8 6 8 3 8 4 8 5 P E ( 1 , 0 ) P E ( 1 , 1 ) P E ( 1 , 2 ) Block-cyclic partitioning (2D)

Block-cyclic partitioning (2D), cont.

ScaLapack – Array Descriptors

Syntax for descinit descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier) IorO arg Description OUT idesc Descriptor IN m Row Size (Global) IN n Column Size (Global) IN mb Row Block Size IN nb Column Size IN i Starting Grid Row IN j Starting Grid Column IN icon BLACS context IN mla Leading Dimension of Local Matrix OUT ier Error number The starting grid location is usually (i=0,j=0).

ScaLapack functionality

ScaLapack API • Prepend LAPACK equivalent names with P P X Y Y Z Z Z C o m p u t a t i o n P e r f o r m e d M a t r i x T y p e D a t a T y p e s

ScaLapack API, cont. • Matrix Types (YY) PXYYZZZ • DB General Band (diagonally dominant-like) • DT general Tridiagonal (Diagonally dominant-like) • GB General Band • GE GEneral matrices (e.g., unsymmetric, rectangular, etc.) • GG General matrices, Generalized problem • HE Complex Hermitian • OR Orthogonal Real • PB Positive definite Banded (symmetric or Hermitian) • PO Positive definite (symmetric or Hermitian) • PT Positive definite Tridiagonal (symmetric or Hermitian) • ST Symmetric Tridiagonal Real • SY SYmmetric • TR TRiangular (possibly quasi-triangular) • TZ TrapeZoidal • UN UNitary complex

ScaLapack API, cont. Drivers (ZZZ) PXYYZZZ • SLLinear Equations (SVX*) • SVLU Solver • VDSingular Value • EVEigenvalue (EVX**) • GVXGeneralized Eigenvalue *Estimates condition numbers. *Provides Error Bounds. *Equilibrates System. **Selected Eigenvalues

ScaLapack function call example Solving Ax=B for a general square matrix A (N x N), B(N x NRHS) CALL PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO ) • Global indices: IA =JA =IB = JB =1 • Ipiv: Integer array (N), on output containing permutation matrix (row i of the matrix was interchanged with row ipiv(i))

ScaLapack performance tips • Use native BLAS • Choose a suitable number of processors, so as not to make the local matrix size too large (1000x1000 is about right) • Try to use square processor grid (Nrow=Ncol) • Block size = 64

PETSc • Portable Extensible Toolkit for Scientific Computation • Intended to facilitate the scalable (parallel) solution of linear and non-linear partial differential equations • Focus on problems discretized using semi or fully implicit methods • Developed by Argonne National Laboratory, MCS

PETSc • Fully usable from Fortran, C, and C++ • Uses MPI for all parallel communications • Object-oriented • Designed for both efficiency and ease of use • Available freely from http://www.mcs.anl.gov/petsc • Installed on Datastar at /usr/local/apps/petsc-2.3.1

PETSc features • Parallel Vectors and discrete functions • Distributed arrays (for regular grid problems) • Parallel vector scatter/gathers (for unstructured grid problems) • Parallel (sparse) matrices • Parallel Krylov subspace methods • Parallel preconditioners • Parallel (Newton-based) nonlinear solvers • Parallel time-stepping code

Nonlinear Solvers Time Steppers Newton based methods other Euler Back- ward Euler Pseudo time- stepping Other Line search Trust region Krylov Subspace Methods GMRES CG CGS Bi-CG-stab TFQMR Richard. Cheby. Other Preconditioners Adapt. Schwartz Block Jacobi ILU ICC LU (sequential only) other Matrices Compressed Sparse Row(AIJ) Blocked compressed Sparse Row(BAIJ) Block Diagonal Dense Other Index Sets Indices Block Indices Stride Other PETSc Numerical components Vectors

MPI use features • Communicators and attributes to handle managing messages inside PETSc objects • Nonblocking sends and receives to overlap communication with computation • Global reduction operations • Derived data types

NAG – Numerical Algorithms Group ® • Proprietary package, currently installed on Datastar in /usr/local/apps/nag (64-bit only) • Differential Equations • Linear Equations • Interpolation, Curve and Surface Fitting • Linear Algebra • Eigensystem Analysis • Statistical Analysis • Random Number Generator • Fourier Transforms, Summation of Series • Time series analysis • Mesh generation

ESSL - Engineering and Scientific Subroutine Library • IBM product – optimized for IBM platforms • Over 400 high performance mathematical functions • C/C++ and Fortran, 32 and 64 bit, thread-safe • API is Fortran-based. From C/C++ must use Fortran calling conventions, e.g. column-major array • Linking: -lessl • ESSL User Guide: http://publib.boulder.ibm.com/epubs/pdf/am501403.pdf

ESSL features • Numerical functions categories • Linear Algebra - vector-scalar, sparse vector-scalar, matrix-vector, sparse matrix-vector • Matrix Operations - addition, subtraction, multiplication, rank-k updates, transpose • Linear Algebra Equation solvers - dense, banded, sparse, linear least-squares • Eigensystem Analysis - standard, generalized • Signal Processing - Fourier transform, convolutions, correlations • Sorting & Searching • Interpolation - polynomial, cubic spline • Numerical Quadrature • Random Number Generation • Utilities

ESSL example – 1D real-to-complex Fast Fourier Transform (FFT) #include <essl.h> main() { … double R[M][N],*work1,*work2; cmplx C[M][N/2+1]; int nwork1,nwork2; Init(R,N,M); … work1 = malloc(nwork1*sizeof(double)); work2 = malloc(nwork2*sizeof(double)); drcft(1,R,N,C,N/2+1,N,M,1,1.0,work1,nwork1,work2,nwork2); drcft(0,R,N,C,N/2+1,N,M,1,1.0,work1,nwork1,work2,nwork2); … }

PESSL • Invokes ESSL • Uses MPI for communication • PESSL Categories • Subset of Level 2, 3 PBLAS • Subset of ScaLAPACK (dense, banded) • Sparse Routines • Subset of ScaLAPACK Eigensystem Routines • Fourier Transforms • Uniform Random Number Generation • BLACS • Link with -lpessl Serial Parallel

FFTW – Fastest Fourier Transform in the West • Third party software, free – http://www.fftw.org • Latest version installed on SDSC systems in /usr/local/apps/fftw312d and /usr/local/apps/fftw312s • 1,2, and 3 dimensional FFTs, and related functions (sin/cos transforms, convolution etc) • Using “plans” • Initialize transform parameters long integer plan1; plan1 = fftw_plan(..N,M,...); fftw_execute(plan,A,B); • Work arrays allocated and initialized behind the scenes • ESTIMATE, MEASURE, PATIENT intialization • +’s Portable, free • -’s Performance about 3 times worse than ESSL’s FFTs!

3D FFT in parallel • FFTW version 3 does not have MPI implementation • PESSL and other libraries use 1D processor decomposition (slabs/planes) • Limitation: Np <= N • Not enough for some state-of-the art simulations, for example turbulence in a periodic domain 40963 • 2D decomposition is crucial for scaling on Petascale platforms

2D decomposition z x y

2D decomposition • 2D decomposition: processors form a square grid P1 x P2 • Columns (pencils) instead of slabs • Local geometry: • Stage 1: N x K x M • Stage 2: K x N x M • Stage 3: K x M x N • K = N/P1, M=N/P2 • At each stage, transform along the local dimension (N) • Software scaling limit removed! Now the limit is P <= N2

Performance of DNS (2D) on Blue Gene at SDSC and IBM’s T.J.Watson Lab; and SDSC’s Datastar VN: Two processors per node CO: One processor per node

P3DFFT • Open source library for efficient, highly scalable 3D FFT on parallel platforms • Built on top of an optimized 1D FFT library • Currently ESSL or FFTW • In the future, more libraries • Uses 2D decomposition, which includes 1D option. • Developed at SDSC • Outcome of a Strategic Applications Collaborations Project (DNS turbulence code) • Available at http://www.sdsc.edu/us/resources/p3dfft.php

P3DFFT: features • Implements real-to-complex (R2C) and complex-to-real (C2R) 3D transforms • Written in Fortran 90 with MPI • Implemented as F90 module • Single or double precision • Arbitrary dimensions • Handles many uneven cases (Ni does not have to be a factor of Pj) • Assumes Ni >= Pj • Can do either in-place transform or otherwise • In which case the input and output arrays should not overlap • Memory requirements: input and output arrays + 1 buffer

Numerical Libraries in High Performance Computing