200 likes | 326 Views
Using Numerical Libraries on Blue Horizon - ESSL & PESSL ( http://www.npaci.edu/BlueHorizon ). Tim Kaiser and Donald Frederick frederik@sdsc.edu San Diego Supercomputing Center. Using Numerical Libraries: Overview. Why use ESSL/PESSL ? How to use ESSL/PESSL Setup Linking libraries.
E N D
Using Numerical Libraries onBlue Horizon - ESSL & PESSL (http://www.npaci.edu/BlueHorizon) Tim Kaiser and Donald Frederick frederik@sdsc.edu San Diego Supercomputing Center
Using Numerical Libraries: Overview • Why use ESSL/PESSL ? • How to use ESSL/PESSL • Setup • Linking libraries
Using Numerical Libraries: Overview • Why? • Fast - written for POWER3 CPU, SP system architecture • Robust, accurate - up-to-date numerical methods • Commonly useful algorithms - linear algebra, FFT, numerical integration, etc. • Contains IBM versions of HPC “standard” libraries - LAPACK, BLAS, PBLAS, BLACS, etc. • How to • Setup • Linking libraries
Laura C. Nett: Describe the break down of the machine: Describe how to request processors or CPUs. Note that # of nodes and tasks are requested via the POE or LL environment where # of threads is set via environmental variable. Current hardware is limited to only using 4 MPI tasks per node meaning that 4 PEs will be idle ONLY if using fast communications...don’t worry because you aren’t charged for them. New colony switch will change that. Note when using threads: set the number of threads to spawn per MPI task. Using Numerical Libraries - ESSL • API - Fortran based. From C, C++ must use Fortran calling conventions • 3 Run-time Libraries (each supporting 32 and 64-bit) • ESSL SMP Library - thread-safe version of ESSL subroutines for use on RS/6000 SMP; subset of subroutines multi-threaded (libesslsmp.a) • ESSL Thread-Safe Library - for users who wish to develop their own multi-threaded programs (libessl_r.a) • ESSL POWER Library - tuned for POWER3 architecture (libesslp2.a, libesslp2_r.a)
Using Numerical Libraries - ESSL • ESSL Numerical Mathematics Categories • Linear Algebra - vector-scalar, sparse vector-scalar, matrix-vector, sparse matrix-vector • Matrix Operations - addition, subtraction, multiplication, rank-k updates, transpose • Linear Algebra Equation solvers - dense, banded, sparse, linear least-squares • Eigensystem Analysis - standard, generalized • Signal Processing - Fourier transform, convolutions, correlations • Sorting & Searching • Interpolation - polynomial, cubic spline • Numerical Quadrature • Random Number Generation • Utilities
Using Numerical Libraries - ESSL • Multi-Threaded ESSL SMP Categories • Vector-scalar Linear Algebra • Matrix Vector Linear Algebra • Matrix Operations • Dense Linear Algebra • Sparse Linear Algebra • Fourier Transforms • Convolutions and Correlations
Laura C. Nett: Interactive jobs run similar to SP2 except you need to define the number of CPUs by defining # of nodes and tasks. if you want threads to be spawned define this before running poe. The interactive CPUs are set aside for use during the day and unlike the SP2 they are not shared. There are various environmental variables that you may want to set and they get defined using all lower case. Remember that the tasks per node is a number between 1-8 the number of CPUs on a node but for straight MPI code using fast communications the max you can set this to is 4. Using Numerical Libraries -PESSL • P[arallel]ESSL - distributed memory (Message Passing) parallel numerical library • Supports SPMD parallel programming model • Uses ESSL subroutines • Uses PE MPI or threaded library for communication • API - Fortran; from C, C++ using Fortran conventions
Laura C. Nett: Interactive jobs run similar to SP2 except you need to define the number of CPUs by defining # of nodes and tasks. if you want threads to be spawned define this before running poe. The interactive CPUs are set aside for use during the day and unlike the SP2 they are not shared. There are various environmental variables that you may want to set and they get defined using all lower case. Remember that the tasks per node is a number between 1-8 the number of CPUs on a node but for straight MPI code using fast communications the max you can set this to is 4. Using Numerical Libraries - PESSL • Invokes ESSL • PESSL Categories • Subset of Level 2, 3 PBLAS • Subset of ScaLAPACK (dense, banded) • Sparse Routines • Subset of ScaLAPACK Eigensystem Routines • Fourier Transforms • Uniform Random Number Generation • BLACS
Using Numerical Libraries - MASS MASS (Mathematical Acceleration Subsystem) - A set of functions that replace several of the mathematical functions in libma (sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y) with faster versions. • MASS scalar routines not thread-safe. • MASS has thread-safe vector intrinsics. Reference - IBM documentation and SDSC “SCAN” article: “Vector intrinsic functions II: IBM SP” www.npaci.edu/online/v3.13/SCAN1.html Link with: • xlf90 -lmassvp3 -L/usr/local/apps/mass/lib your_source.f • xlc -lmassvp3 -L/usr/local/apps/mass/lib -lm your_source.c
Using Numerical Libraries - ESSL Example Examples in: /work/training/examples/pessl Copy examples to your home directory: cp /work/training/examples/pessl/* . Make Edit LoadLeveler script “ll.script”
Using Numerical Libraries - PESSL LA Solver Fundamentals • Vectors, matrices are distributed across your processes prior to calling the Parallel ESSL subroutines. • You must define your process grid and distribute your data according to the distribution technique required by the Parallel ESSL subroutine you are using. • A parallel machine with k processes is often thought of as a one-dimensional linear array of processes labeled 0, 1, ..., k-1. For performance reasons, it is sometimes useful to map this one-dimensional array into a logical two-dimensional rectangular grid, which is also referred to as process grid. • Most PESSL subroutines support block-cyclic distribution. The Banded Linear Algebraic Equations and the Fourier transform subroutines only support block distribution.
Using Numerical Libraries - PESSL LA Solver Fundamentals: Layout • Similar to MPI but you specify the processor topology • icontxt like MPI_COMM_WORLD • blacs_gridinit specifies topology • blacs_gridinfo gets processor information • Here we define a 3 element array of processors nprow=1; npcol=3; order='r'; zero=0; blacs_get(zero,zero,icontxt); blacs_gridinit(icontxt,&order,nprow,npcol); blacs_gridinfo(icontxt,nprow,npcol,myrow,mycol);
Using Numerical Libraries - PESSL LA Solver Example Examples in /work/training/examples/pessl Sparse matrix solver (F90) • make pde90
Using Numerical Libraries - PESSL - 3D FFT in Fortran • Our array is 4 x 2 x 6 • We use 3 processors so we dimension 4 x 2 x 2 on each processor • We use a Fortran 90 “case” statement to set up the data • Look at pessl.f COMPLEX*16 X(0:3,0:1,0:1) ORDER = 'R' NPROW = 1 NPCOL = 3 CALL BLACS_GET(0, 0, ICONTXT) CALL BLACS_GRIDINIT(ICONTXT, ORDER, NPROW, NPCOL) CALL BLACS_GRIDINFO(ICONTXT, NPROW, NPCOL, MYROW, MYCOL)
Using Numerical Libraries - PESSL - 3D FFT in C++ • Define the arrays in reverse order (Fortran-style) • Provide interfaces for the BLACS routines • Create a contiguous block of data for input • Size is i*j*k*2 • Can not use normal matrix allocation routines • Look at c_ex03.C
Using Numerical Libraries - PESSL Compiling • C++ • Need to declare interfaces • Fortran • mpxlf90 -O3 -bmaxdata:0x80000000 -lm -lblas -lessl -lblacs -lpessl pessl2.f -o pessl2.exe • Fortran SMP nodes • mpxlf90_r -O3 -bmaxdata:0x80000000 -lm -lblas -lesslsmp -lblacssmp -lpesslsmp pessl2.f -o pessl2.smp • extern "FORTRAN" void blacs_gridinfo(const int &, const int &, const int &, const int &, const int &); • extern "FORTRAN" void blacs_gridinit(const int &, char *, const int &, const int &); • extern "FORTRAN" void blacs_get(const int &, const int &, const int &); • mpCC -lm -lblas -lessl -lblacs -lpessl -lcomplex c_ex02.C -o c_ex02.exe
Using Numerical Libraries - PESSL - FFT run times • Run times for 3d complex FFT on Blue Horizon using PESSL routine pdcft3. • The data was complex with 8 bytes for each part. • Times are for 4 processors /node. • The data was left in the transposed form after the FFT. • We have times given twice, sorted by constant size and constant number of processors.
Using Numerical Libraries • References • “Engineering and Scientific Subroutine Library (ESSL) for AIX Guide and Reference” www.npaci.edu/BlueHorizon/Docs/essl_guide_ref_v3r12.pdf • “Parallel Engineering and Scientific Subroutine Library (PESSL) for AIX Guide and Reference” www.npaci.edu/BlueHorizon/Docs/pessl_guide_ref_v2r12.pdf • IBM RS/6000 SP References www.rs6000.ibm.com/resource/aix_resource/sp_books