360 likes | 516 Views
Libraries and Program Performance. NERSC User Services Group. An Embarrassment of Riches: Serial. Threaded Libraries (Threaded). Parallel Libraries (Distributed & Threaded). Elementary Math Functions. Three libraries provide elementary math functions: C/Fortran intrinsics
E N D
LibrariesandProgram Performance NERSC User Services Group
Elementary Math Functions • Three libraries provide elementary math functions: • C/Fortran intrinsics • MASS/MASSV (Math Acceleration Subroutine System) • ESSL/PESSL (Engineering Scientific Subroutine Library • Language intrinsics are the most convenient, but not the best performers
Elementary Functions in Libraries • MASS • sqrtrsqrtexplogsincostanatanatan2sinhcoshtanhdnintx**y • MASSV • cos dint exp log sin log tan div rsqrt sqrt atan See http://www.nersc.gov/nusers/resources/software/libs/math/MASS/
Other Intrinsics in Libraries • ESSL • Linear Algebra Subprograms • Matrix Operations • Linear Algebraic Equations • Eigensystem Analysis • Fourier Transforms, Convolutions, Correlations, and Related Computations • Sorting and Searching • Interpolation • Numerical Quadrature • Random Number Generation
Comparing Elementary Functions • Loop schema for elementary functions: 99 write(6,98) 98 format( " sqrt: " ) x = pi/4.0 call f_hpmstart(1,"sqrt") do 100 i = 1, loopceil y = sqrt(x) x = y * y 100 continue call f_hpmstop(1) write(6,101) x 101 format( " x = ", g21.14 )
Comparing Elementary Functions • Execution schema for elementary functions: setenv F1 "-qfixed=80 -qarch=pwr3 -qtune=pwr3 O3 -qipa" module load hpmtoolkit module load mass module list setenv L1 "-Wl,-v,-bC:massmap" xlf90_r masstest.F $F1 $L1 $MASS $HPMTOOLKIT -o masstest timex mathtest < input > mathout
Results Examined • Answers after 50e6 iterations • User execution time • # Floating and FMA instructions • Operation rate in Mflip/sec
Results Observed • No difference in answers • Best times/rates at -O3 or -O4 • ESSL no different from intrinsics • MASS much faster than intrinsics
Comparing Higher Level Functions • Several sources of matrix-multiply function: • User coded scalar computation • Fortran intrinsic matmul • Single processor ESSL dgemm • Multi-threaded SMP ESSL dgemm • Single processor IMSL dmrrrr (32-bit) • Single processor NAG f01ckf • Multi-threaded SMP NAGf01ckf
Sample Problem • Multiply dense matrixes : • A(1:n,1:n) = i + j • B(1:n,1:n) = j – i • C(1:n,1:n)= A * B Output C to verify result
Kernel of user matrix multiply do i=1,n do j=1,n a(i,j) = real(i+j) b(i,j) = real(j-i) enddo enddo call f_hpmstart(1,"Matrix multiply") do j=1,n do k=1,n do i=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo call f_hpmstop(1)
Comparison of Matrix Multiply(N1=5,000) Version Wall Clock(sec) Mflip/s Scaled Time (1=Fastest) User Scalar 1,490 168 106 Slowest Intrinsic 1,477 169 106 Slowest ESSL 195 1,280 13.9 IMSL 194 1,290 13.8 NAG 195 1,280 13.9 ESSL-SMP 14 17,800 1.0 Fastest NAG-SMP 14 17,800 1.0 Fastest
Observations on Matrix Multiply • Fastest times were obtained by the two SMP libraries, ESSL-SMP and NAG-SMP, which both obtained 74% of the peak node performance • All the single processor library functions took 14 times more wall clock time than the SMP versions, each obtaining about 85% of peak for a single processor • Worst times were from the user code and the Fortran intrinsic, which took 100 times more wall clock time than the SMP libraries
Comparison of Matrix Multiply(N2=10,000) Version Wall Clock(sec) Mflip/s Scaled Time ESSL-SMP 101 19,800 1.01 NAG-SMP 100 19,900 1.00 • Scaling with Problem Size (Complexity increase ~8x) Version Wall Clock(N2/N1) Mflip/s (N2/N1) ESSL-SMP 7.2 1.10 NAG-SMP 7.1 1.12 Both ESSL-SMP and NAG-SMP showed 10% performance gains with the larger problem size.
Observations on Scaling • Scaling of problem size was only done for the SMP libraries, to fit into reasonable times. • Doubling N results in 8 times increase of computational complexity for dense matrix multiplication • Performance actually increased for both routines for larger problem size.
ESSL-SMP Performance vs. Number of Threads • All for N=10,000 • Number of threads controlled by environment variable OMP_NUM_THREADS
Parallelism: Choices Based on Problem Size Doing a month’s work in a few minutes! • Three Good Choices • ESSL / LAPACK • ESSL-SMP • SCALAPACK Only beyond a certain problem size is there any opportunity for parallelism. Matrix-Matrix Multiply
Larger Functions: FFT’s • ESSL, FFTW, NAG, IMSL See http://www.nersc.gov/nusers/resources/software/libs/math/fft/ • We looked at ESSL, NAG, and IMSL • One-D, forward and reverse
One-D FFT’s • NAG • c06eaf - forward • c06ebf - inverse, conjugate needed • c06faf - forward, work-space needed • c06ebf - inverse, work-space & conjugate needed • IMSL • z_fast_dft - forward & reverse, separate arrays • ESSL • drcft - forward & reverse, work-space & initialization step needed • All have size constraints on their data sets
One-D FFT Measurement • 2^24 real*8 data points input (a synthetic signal) • Each transform ran in a 20-iteration loop • All performed both forward and inverse transforms on same data • Input and inverse outputs were identical • Measured with HPMToolkit second1 = rtc() call f_hpmstart(33,"nag2 forward") do loopind=1, loopceil w(1:n) = x(1:n) call c06faf( w, n, work, ifail ) end do call f_hpmstop(33) second2 = rtc()
One-D FFT Performance NAG c06eaf fwd 25.182 sec. 54.006 Mflip/s C06ebf inv 24.465 sec. 40.666 Mflip/s c06faf fwd 29.451 sec. 46.531 Mflip/s c06ebf inv 24.469 sec. 40.663 Mflip/s (required a data copy for each iteration for each transform) IMSL z_fast_dft fwd 71.479 sec. 46.027 Mflip/s z_fast_dft inv 71.152 sec. 48.096 Mflip/s ESSL drcft init 0.032 sec. 62.315 Mflip/s drcft fwd 3.573 sec. 274.009 Mflip/s drcft init 0.058 sec. 96.384 Mflip/s drcft inv 3.616 sec. 277.650 Mflip/s
ESSL and ESSL-SMP : “Easy” parallelism ( -qsmp=omp –qessl -lomp –lesslsmp ) • For simple problems can dial in the local data size by adjusting number of threads • Cache reuse can lead to superlinear speed up. • NH II node has 128 MB Cache!
1 2 3 4 1 2 3 4 Parallelism Beyond One Node : MPI 2D problem in 4 tasks Distributed Data Decomposition • Distributed parallelism (MPI) requires both local and global addressing contexts • Dimensionality of decomposition can have profound impact on scalability • Consider surface to volume ratio Surface = communication (MPI) Volume = local work (HPM) • Decomposition is often cause of load imbalance which can reduce parallel efficiency 2D problem in 20 tasks
Example FFTW: 3D • Popular for its portability and performance • Also consider PESSL’s FFTs (not treated here) • Uses a slab (1D) data decomposition • Direct algorithms for transforms of dimensions of size: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64 • For parallel FFTW calls transforms are done in place Plan: fftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny, nz, FFTW_FORWARD,flags); fftwnd_mpi_local_sizes(plan, &lnx, &lxs, &lnyt, &lyst, &lsize); Transform: fftwnd_mpi(plan, 1, data, work, transform_flags); What are these?
FFTW: Data Decomposition Each MPI rank owns a portion of the problem Local Address Context: for(x=0;x<lnx;x++) for(y=0;y<ny;y++) for(z=0;z<nz;z++) { data[x][[y][z] = f(x+lxs,y,z); } nx Global Address Context: for(x=0;x<nx;x++) for(y=0;y<ny;y++) for(z=0;z<nz;z++) { data[x][[y][z] = f(x,y,z); } ny nz lxs lxs+lnx
FFTW : Parallel Performance • FFT may be complex function of problem size . Prime factors of the dimensions and concurrency determine performance • Consider data decompositions and paddings that lead to optimal local data sizes: cache use and prime factors
FFTW : Wisdom • Runtime performance optimization, can be stored to file • Wise options: FFTW_MEASURE | FFTW_USE_WISDOM • Unwise options: FFTW_ESTIMATE • Wisdom works better for serial FFTs. Some benefit for parallel must amortize increase in planning overhead.