Libraries and Program Performance

LibrariesandProgram Performance NERSC User Services Group

An Embarrassment of Riches: Serial

Threaded Libraries (Threaded)

Parallel Libraries(Distributed & Threaded)

Elementary Math Functions • Three libraries provide elementary math functions: • C/Fortran intrinsics • MASS/MASSV (Math Acceleration Subroutine System) • ESSL/PESSL (Engineering Scientific Subroutine Library • Language intrinsics are the most convenient, but not the best performers

Elementary Functions in Libraries • MASS • sqrtrsqrtexplogsincostanatanatan2sinhcoshtanhdnintx**y • MASSV • cos dint exp log sin log tan div rsqrt sqrt atan See http://www.nersc.gov/nusers/resources/software/libs/math/MASS/

Other Intrinsics in Libraries • ESSL • Linear Algebra Subprograms • Matrix Operations • Linear Algebraic Equations • Eigensystem Analysis • Fourier Transforms, Convolutions, Correlations, and Related Computations • Sorting and Searching • Interpolation • Numerical Quadrature • Random Number Generation

Comparing Elementary Functions • Loop schema for elementary functions: 99 write(6,98) 98 format( " sqrt: " ) x = pi/4.0 call f_hpmstart(1,"sqrt") do 100 i = 1, loopceil y = sqrt(x) x = y * y 100 continue call f_hpmstop(1) write(6,101) x 101 format( " x = ", g21.14 )

Comparing Elementary Functions • Execution schema for elementary functions: setenv F1 "-qfixed=80 -qarch=pwr3 -qtune=pwr3 O3 -qipa" module load hpmtoolkit module load mass module list setenv L1 "-Wl,-v,-bC:massmap" xlf90_r masstest.F $F1 $L1 $MASS $HPMTOOLKIT -o masstest timex mathtest < input > mathout

Results Examined • Answers after 50e6 iterations • User execution time • # Floating and FMA instructions • Operation rate in Mflip/sec

Results Observed • No difference in answers • Best times/rates at -O3 or -O4 • ESSL no different from intrinsics • MASS much faster than intrinsics

Comparing Higher Level Functions • Several sources of matrix-multiply function: • User coded scalar computation • Fortran intrinsic matmul • Single processor ESSL dgemm • Multi-threaded SMP ESSL dgemm • Single processor IMSL dmrrrr (32-bit) • Single processor NAG f01ckf • Multi-threaded SMP NAGf01ckf

Sample Problem • Multiply dense matrixes : • A(1:n,1:n) = i + j • B(1:n,1:n) = j – i • C(1:n,1:n)= A * B Output C to verify result

Kernel of user matrix multiply do i=1,n do j=1,n a(i,j) = real(i+j) b(i,j) = real(j-i) enddo enddo call f_hpmstart(1,"Matrix multiply") do j=1,n do k=1,n do i=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo call f_hpmstop(1)

Comparison of Matrix Multiply(N1=5,000) Version Wall Clock(sec) Mflip/s Scaled Time (1=Fastest) User Scalar 1,490 168 106 Slowest Intrinsic 1,477 169 106 Slowest ESSL 195 1,280 13.9 IMSL 194 1,290 13.8 NAG 195 1,280 13.9 ESSL-SMP 14 17,800 1.0 Fastest NAG-SMP 14 17,800 1.0 Fastest

Observations on Matrix Multiply • Fastest times were obtained by the two SMP libraries, ESSL-SMP and NAG-SMP, which both obtained 74% of the peak node performance • All the single processor library functions took 14 times more wall clock time than the SMP versions, each obtaining about 85% of peak for a single processor • Worst times were from the user code and the Fortran intrinsic, which took 100 times more wall clock time than the SMP libraries

Comparison of Matrix Multiply(N2=10,000) Version Wall Clock(sec) Mflip/s Scaled Time ESSL-SMP 101 19,800 1.01 NAG-SMP 100 19,900 1.00 • Scaling with Problem Size (Complexity increase ~8x) Version Wall Clock(N2/N1) Mflip/s (N2/N1) ESSL-SMP 7.2 1.10 NAG-SMP 7.1 1.12 Both ESSL-SMP and NAG-SMP showed 10% performance gains with the larger problem size.

Observations on Scaling • Scaling of problem size was only done for the SMP libraries, to fit into reasonable times. • Doubling N results in 8 times increase of computational complexity for dense matrix multiplication • Performance actually increased for both routines for larger problem size.

ESSL-SMP Performance vs. Number of Threads • All for N=10,000 • Number of threads controlled by environment variable OMP_NUM_THREADS

Parallelism: Choices Based on Problem Size Doing a month’s work in a few minutes! • Three Good Choices • ESSL / LAPACK • ESSL-SMP • SCALAPACK Only beyond a certain problem size is there any opportunity for parallelism. Matrix-Matrix Multiply

Larger Functions: FFT’s • ESSL, FFTW, NAG, IMSL See http://www.nersc.gov/nusers/resources/software/libs/math/fft/ • We looked at ESSL, NAG, and IMSL • One-D, forward and reverse

One-D FFT’s • NAG • c06eaf - forward • c06ebf - inverse, conjugate needed • c06faf - forward, work-space needed • c06ebf - inverse, work-space & conjugate needed • IMSL • z_fast_dft - forward & reverse, separate arrays • ESSL • drcft - forward & reverse, work-space & initialization step needed • All have size constraints on their data sets

One-D FFT Measurement • 2^24 real*8 data points input (a synthetic signal) • Each transform ran in a 20-iteration loop • All performed both forward and inverse transforms on same data • Input and inverse outputs were identical • Measured with HPMToolkit second1 = rtc() call f_hpmstart(33,"nag2 forward") do loopind=1, loopceil w(1:n) = x(1:n) call c06faf( w, n, work, ifail ) end do call f_hpmstop(33) second2 = rtc()

One-D FFT Performance NAG c06eaf fwd 25.182 sec. 54.006 Mflip/s C06ebf inv 24.465 sec. 40.666 Mflip/s c06faf fwd 29.451 sec. 46.531 Mflip/s c06ebf inv 24.469 sec. 40.663 Mflip/s (required a data copy for each iteration for each transform) IMSL z_fast_dft fwd 71.479 sec. 46.027 Mflip/s z_fast_dft inv 71.152 sec. 48.096 Mflip/s ESSL drcft init 0.032 sec. 62.315 Mflip/s drcft fwd 3.573 sec. 274.009 Mflip/s drcft init 0.058 sec. 96.384 Mflip/s drcft inv 3.616 sec. 277.650 Mflip/s

ESSL and ESSL-SMP : “Easy” parallelism ( -qsmp=omp –qessl -lomp –lesslsmp ) • For simple problems can dial in the local data size by adjusting number of threads • Cache reuse can lead to superlinear speed up. • NH II node has 128 MB Cache!

1 2 3 4 1 2 3 4 Parallelism Beyond One Node : MPI 2D problem in 4 tasks Distributed Data Decomposition • Distributed parallelism (MPI) requires both local and global addressing contexts • Dimensionality of decomposition can have profound impact on scalability • Consider surface to volume ratio Surface = communication (MPI) Volume = local work (HPM) • Decomposition is often cause of load imbalance which can reduce parallel efficiency 2D problem in 20 tasks

Example FFTW: 3D • Popular for its portability and performance • Also consider PESSL’s FFTs (not treated here) • Uses a slab (1D) data decomposition • Direct algorithms for transforms of dimensions of size: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64 • For parallel FFTW calls transforms are done in place Plan: fftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny, nz, FFTW_FORWARD,flags); fftwnd_mpi_local_sizes(plan, &lnx, &lxs, &lnyt, &lyst, &lsize); Transform: fftwnd_mpi(plan, 1, data, work, transform_flags); What are these?

FFTW: Data Decomposition Each MPI rank owns a portion of the problem Local Address Context: for(x=0;x<lnx;x++) for(y=0;y<ny;y++) for(z=0;z<nz;z++) { data[x][[y][z] = f(x+lxs,y,z); } nx Global Address Context: for(x=0;x<nx;x++) for(y=0;y<ny;y++) for(z=0;z<nz;z++) { data[x][[y][z] = f(x,y,z); } ny nz lxs lxs+lnx

FFTW : Parallel Performance • FFT may be complex function of problem size . Prime factors of the dimensions and concurrency determine performance • Consider data decompositions and paddings that lead to optimal local data sizes: cache use and prime factors

FFTW : Wisdom • Runtime performance optimization, can be stored to file • Wise options: FFTW_MEASURE | FFTW_USE_WISDOM • Unwise options: FFTW_ESTIMATE • Wisdom works better for serial FFTs. Some benefit for parallel must amortize increase in planning overhead.

Libraries and Program Performance

Libraries and Program Performance

Presentation Transcript

MPI Program Performance

Numerical Libraries in High Performance Computing

Columbia Libraries Digital Program

Performance Management Program

Program Performance and Improvement

Performance Management Program

Performance Management Program

PERFORMANCE RECOGNITION PROGRAM

Libraries and Their Performance

Performance Appraisal Program

Performance Management Program

Libraries Digital Program

Performance Management Program

Performance Programming with IBM pSeries Compilers and Libraries

Performance Recognition Program

PUBLIC LIBRARIES AND HIGH PERFORMANCE BROADBAND ...NOW WHAT?

Performance indicators in Finnish Scientific Libraries

Performance Recognition Program

MPI Program Performance

Program Libraries

Libraries and Program Performance