1 / 36

Libraries and Program Performance

Libraries and Program Performance. NERSC User Services Group. An Embarrassment of Riches: Serial. Threaded Libraries (Threaded). Parallel Libraries (Distributed & Threaded). Elementary Math Functions. Three libraries provide elementary math functions: C/Fortran intrinsics

alexis
Download Presentation

Libraries and Program Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LibrariesandProgram Performance NERSC User Services Group

  2. An Embarrassment of Riches: Serial

  3. Threaded Libraries (Threaded)

  4. Parallel Libraries(Distributed & Threaded)

  5. Elementary Math Functions • Three libraries provide elementary math functions: • C/Fortran intrinsics • MASS/MASSV (Math Acceleration Subroutine System) • ESSL/PESSL (Engineering Scientific Subroutine Library • Language intrinsics are the most convenient, but not the best performers

  6. Elementary Functions in Libraries • MASS • sqrtrsqrtexplogsincostanatanatan2sinhcoshtanhdnintx**y • MASSV • cos dint exp log sin log tan div rsqrt sqrt atan See http://www.nersc.gov/nusers/resources/software/libs/math/MASS/

  7. Other Intrinsics in Libraries • ESSL • Linear Algebra Subprograms • Matrix Operations • Linear Algebraic Equations • Eigensystem Analysis • Fourier Transforms, Convolutions, Correlations, and Related Computations • Sorting and Searching • Interpolation • Numerical Quadrature • Random Number Generation

  8. Comparing Elementary Functions • Loop schema for elementary functions: 99 write(6,98) 98 format( " sqrt: " ) x = pi/4.0 call f_hpmstart(1,"sqrt") do 100 i = 1, loopceil y = sqrt(x) x = y * y 100 continue call f_hpmstop(1) write(6,101) x 101 format( " x = ", g21.14 )

  9. Comparing Elementary Functions • Execution schema for elementary functions: setenv F1 "-qfixed=80 -qarch=pwr3 -qtune=pwr3 O3 -qipa" module load hpmtoolkit module load mass module list setenv L1 "-Wl,-v,-bC:massmap" xlf90_r masstest.F $F1 $L1 $MASS $HPMTOOLKIT -o masstest timex mathtest < input > mathout

  10. Results Examined • Answers after 50e6 iterations • User execution time • # Floating and FMA instructions • Operation rate in Mflip/sec

  11. Results Observed • No difference in answers • Best times/rates at -O3 or -O4 • ESSL no different from intrinsics • MASS much faster than intrinsics

  12. Comparing Higher Level Functions • Several sources of matrix-multiply function: • User coded scalar computation • Fortran intrinsic matmul • Single processor ESSL dgemm • Multi-threaded SMP ESSL dgemm • Single processor IMSL dmrrrr (32-bit) • Single processor NAG f01ckf • Multi-threaded SMP NAGf01ckf

  13. Sample Problem • Multiply dense matrixes : • A(1:n,1:n) = i + j • B(1:n,1:n) = j – i • C(1:n,1:n)= A * B Output C to verify result

  14. Kernel of user matrix multiply do i=1,n do j=1,n a(i,j) = real(i+j) b(i,j) = real(j-i) enddo enddo call f_hpmstart(1,"Matrix multiply") do j=1,n do k=1,n do i=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo call f_hpmstop(1)

  15. Comparison of Matrix Multiply(N1=5,000) Version Wall Clock(sec) Mflip/s Scaled Time (1=Fastest) User Scalar 1,490 168 106 Slowest Intrinsic 1,477 169 106 Slowest ESSL 195 1,280 13.9 IMSL 194 1,290 13.8 NAG 195 1,280 13.9 ESSL-SMP 14 17,800 1.0 Fastest NAG-SMP 14 17,800 1.0 Fastest

  16. Observations on Matrix Multiply • Fastest times were obtained by the two SMP libraries, ESSL-SMP and NAG-SMP, which both obtained 74% of the peak node performance • All the single processor library functions took 14 times more wall clock time than the SMP versions, each obtaining about 85% of peak for a single processor • Worst times were from the user code and the Fortran intrinsic, which took 100 times more wall clock time than the SMP libraries

  17. Comparison of Matrix Multiply(N2=10,000) Version Wall Clock(sec) Mflip/s Scaled Time ESSL-SMP 101 19,800 1.01 NAG-SMP 100 19,900 1.00 • Scaling with Problem Size (Complexity increase ~8x) Version Wall Clock(N2/N1) Mflip/s (N2/N1) ESSL-SMP 7.2 1.10 NAG-SMP 7.1 1.12 Both ESSL-SMP and NAG-SMP showed 10% performance gains with the larger problem size.

  18. Observations on Scaling • Scaling of problem size was only done for the SMP libraries, to fit into reasonable times. • Doubling N results in 8 times increase of computational complexity for dense matrix multiplication • Performance actually increased for both routines for larger problem size.

  19. ESSL-SMP Performance vs. Number of Threads • All for N=10,000 • Number of threads controlled by environment variable OMP_NUM_THREADS

  20. Parallelism: Choices Based on Problem Size Doing a month’s work in a few minutes! • Three Good Choices • ESSL / LAPACK • ESSL-SMP • SCALAPACK Only beyond a certain problem size is there any opportunity for parallelism. Matrix-Matrix Multiply

  21. Larger Functions: FFT’s • ESSL, FFTW, NAG, IMSL See http://www.nersc.gov/nusers/resources/software/libs/math/fft/ • We looked at ESSL, NAG, and IMSL • One-D, forward and reverse

  22. One-D FFT’s • NAG • c06eaf - forward • c06ebf - inverse, conjugate needed • c06faf - forward, work-space needed • c06ebf - inverse, work-space & conjugate needed • IMSL • z_fast_dft - forward & reverse, separate arrays • ESSL • drcft - forward & reverse, work-space & initialization step needed • All have size constraints on their data sets

  23. One-D FFT Measurement • 2^24 real*8 data points input (a synthetic signal) • Each transform ran in a 20-iteration loop • All performed both forward and inverse transforms on same data • Input and inverse outputs were identical • Measured with HPMToolkit second1 = rtc() call f_hpmstart(33,"nag2 forward") do loopind=1, loopceil w(1:n) = x(1:n) call c06faf( w, n, work, ifail ) end do call f_hpmstop(33) second2 = rtc()

  24. One-D FFT Performance NAG c06eaf fwd 25.182 sec. 54.006 Mflip/s C06ebf inv 24.465 sec. 40.666 Mflip/s c06faf fwd 29.451 sec. 46.531 Mflip/s c06ebf inv 24.469 sec. 40.663 Mflip/s (required a data copy for each iteration for each transform) IMSL z_fast_dft fwd 71.479 sec. 46.027 Mflip/s z_fast_dft inv 71.152 sec. 48.096 Mflip/s ESSL drcft init 0.032 sec. 62.315 Mflip/s drcft fwd 3.573 sec. 274.009 Mflip/s drcft init 0.058 sec. 96.384 Mflip/s drcft inv 3.616 sec. 277.650 Mflip/s

  25. ESSL and ESSL-SMP : “Easy” parallelism ( -qsmp=omp –qessl -lomp –lesslsmp ) • For simple problems can dial in the local data size by adjusting number of threads • Cache reuse can lead to superlinear speed up. • NH II node has 128 MB Cache!

  26. 1 2 3 4 1 2 3 4 Parallelism Beyond One Node : MPI 2D problem in 4 tasks Distributed Data Decomposition • Distributed parallelism (MPI) requires both local and global addressing contexts • Dimensionality of decomposition can have profound impact on scalability • Consider surface to volume ratio Surface = communication (MPI) Volume = local work (HPM) • Decomposition is often cause of load imbalance which can reduce parallel efficiency 2D problem in 20 tasks

  27. Example FFTW: 3D • Popular for its portability and performance • Also consider PESSL’s FFTs (not treated here) • Uses a slab (1D) data decomposition • Direct algorithms for transforms of dimensions of size: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64 • For parallel FFTW calls transforms are done in place Plan: fftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny, nz, FFTW_FORWARD,flags); fftwnd_mpi_local_sizes(plan, &lnx, &lxs, &lnyt, &lyst, &lsize); Transform: fftwnd_mpi(plan, 1, data, work, transform_flags); What are these?

  28. FFTW: Data Decomposition Each MPI rank owns a portion of the problem Local Address Context: for(x=0;x<lnx;x++) for(y=0;y<ny;y++) for(z=0;z<nz;z++) { data[x][[y][z] = f(x+lxs,y,z); } nx Global Address Context: for(x=0;x<nx;x++) for(y=0;y<ny;y++) for(z=0;z<nz;z++) { data[x][[y][z] = f(x,y,z); } ny nz lxs lxs+lnx

  29. FFTW : Parallel Performance • FFT may be complex function of problem size . Prime factors of the dimensions and concurrency determine performance • Consider data decompositions and paddings that lead to optimal local data sizes: cache use and prime factors

  30. FFTW : Wisdom • Runtime performance optimization, can be stored to file • Wise options: FFTW_MEASURE | FFTW_USE_WISDOM • Unwise options: FFTW_ESTIMATE • Wisdom works better for serial FFTs. Some benefit for parallel must amortize increase in planning overhead.

More Related