840 likes | 1.11k Views
Libraries and Their Performance. Frank V. Hale Thomas M. DeBoni NERSC User Services Group. Part I: Single Node Performance Measurement. Use of hpmcount for measurement of total code performance Use of HPM Toolkit for measurement of code section performance
E N D
Libraries and Their Performance Frank V. Hale Thomas M. DeBoni NERSC User Services Group
Part I: Single Node Performance Measurement • Use of hpmcount for measurement of total code performance • Use of HPM Toolkit for measurement of code section performance • Vector operations generally give better performance than scalar (indexed) operations • Shared-memory, SMP parallelism can be very effective and easy to use
Demonstration Problem • Compute p using random points in unit square (ratio of points in unit circle to points in unit square) • Use input file with sequence of 134,217,728 uniformly distributed random numbers in range 0-1; unformatted, 8-byte floating point numbers (1 gigabyte of data)
A first Fortran code % cat estpi1.f implicit none integer i,points,circle real*8 x,y read(*,*)points open(10,file="runiform1.dat",status="old",form="unformatted") circle = 0 c repeat for each (x,y) data point: read and compute do i=1,points read(10)x read(10)y if (sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5) circle = circle + 1 enddo write(*,*)"Estimated pi using ",points," points as ", . ((4.*circle)/points) end
Compile and Run with hpmcount % cat jobestpi1 #@ class = debug #@ shell = /usr/bin/csh #@ wall_clock_limit = 00:29:00 #@ notification = always #@ job_type = serial #@ output = jobestpi1.out #@ error = jobestpi1.out #@ environment = COPY_ALL #@ queue setenv FC "xlf_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 " $FC -o estpi1 estpi1.f echo "10000" > estpi1.dat hpmcount ./estpi1 <estpi1.dat exit
Some Observations • Performance is not very good at all, less than 1 Mflip/s (peak is 1,500 Mflip/s per processor) • Scalar approach to computation • Scalar I/O mixed with scalar computation Suggestions: • Separate I/O from computation • Use vector operations on dynamically allocated vector data structures
A second code, Fortran 90 % cat estpi2.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y c dynamically allocated vector data structures read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) end
Observations on Second Code • Operations on whole vectors should be faster, but • No real improvement in performance of total code was observed. • Suspect that most time is being spent on I/O. • I/O is now separate from computation, so the code is easy to instrument in sections
Instrument code sections with HPM Toolkit Four sections to be separately measured: • Data structure initialization • Read data • Estimate p • Write output Calls to f_hpmstart and f_hpmstop around each section.
Instrumented Code (1 of 2) %cat estpi3.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y #include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 call f_hpmstop(1)
Instrumented Code (2 of 2) call f_hpmstart(2,"Read data") open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo call f_hpmstop(2) call f_hpmstart(3,"Estimate pi") circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) call f_hpmstop(3) call f_hpmstart(4,"Write output") write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) call f_hpmstop(4) call f_hpmterminate(0) end
Notes on Instrumented Code • Entire executable code enclosed between hpm_init and hpm_terminate • Code sections enclosed between hpm_start and hpm_stop • Descriptive text labels appear in output file(s)
Compile and Run with HPM Toolkit % cat jobestpi3 #@ class = debug #@ shell = /usr/bin/csh #@ wall_clock_limit = 00:29:00 #@ notification = always #@ job_type = serial #@ output = jobestpi3.out #@ error = jobestpi3.out #@ environment = COPY_ALL #@ queue module load hpmtoolkit setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f" $FC -o estpi3 estpi3.f echo "10000000" > estpi3.dat ./estpi3 <estpi3.dat exit
Notes on Use of HPM Toolkit • Must load module hpmtoolkit • Need to include header file f_hpm.h in Fortran code, and give preprocessor directions to compiler with -qsuffix • Performance output in a file named like perfhpmNNNN.MMMMM where NNNN is the task id and MMMMM is the process id • Message from sample executable: libHPM output in perfhpm0000.21410
Comparison of Code Sections 10,000,000 points
Observations on Sections • Optimization of the estimation of p has little effect because • The code spends 99% of the time reading the data • Can the I/O be optimized?
Reworking the I/O • Whole arrary I/O versus scalar I/O • Scalar I/O (one number per record) file is twice as big (8 bytes for number, 8 bytes for end of record) • Whole array I/O file has only one end of record marker • Only one call for Fortran read routine for whole array I/O read(10)xy • Need to use some fancy array footwork to sort out x(1), y(1), x(2), y(2), … x(n), y(n) from xy array. x = xy(1::2) y = xy(2::2)
Revised Data Structures and I/O % cat estpi4.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x, y, xy #include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (xy(2*points)) allocate (ones(points)) ones = 1 call f_hpmstop(1) call f_hpmstart(2,"Read data") open(10,file="runiform.dat",status="old",form="unformatted") read(10)xy x = xy(1::2) y = xy(2::2) call f_hpmstop(2)
Vector I/O Code Sections 10,000,000 points
Observations on New Sections • The time spent reading the data as a vector rather than a scalar was reduced from 89.9 to 3.16 seconds, a reduction of 96% of the I/O time. • There was no performance penalty for the additional data structure complexity. • I/O design can have very significant performance impacts! • Total code performance with hpmcount is now 15.4 Mflip/s, 20 times improved from the 0.801 Mflip/s of the scalar I/O code.
Automatic Shared-Memory (SMP) Parallelization • IBM Fortran provides a–qsmpoption for automatic, shared-memory parallelization, allowing multithreaded computation within a node. • Default number of threads is 16; the number of threads is controlled by OMP_NUM_THREADS environment variable • Allows use of the SMP version of the ESSL library, -lesslsmp
Compiler Options • The source code is the same as the previous, vector operation example, estpi4.f • Compiler options –qsmp and –lesslsmp enable automatic shared-memory parallelism (SMP) • Compiler command line: xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp –lesslsmp -o estpi5 estpi4.f
SMP Code Sections 10,000,000 points
Observations on SMP Code • Computational section is now showing 1,100 Mflip/sec, or 4.6% of theoretical peak of 24,000 Mflip/sec on 16 processor node. • Computational section is now 12 times faster, with no changes to source code • Recommendation: always use thread-safe compilers (with _r suffix) and –qsmp unless there is a good reason to do otherwise. • There are no explicit parallelism directives in the source code; all threading is within the library.
Too Many Threads Can Spoil Performance • Each node has 16 processors, and usually having more threads than processors will not improve performance
Sidebar: Cost of Misaligned Common Block • User code with Fortran77 style common blocks may receive an innocuous warning: 1514-008 (W) Variable … is misaligned. This may affect the efficiency of the code. • How much can this affect the efficiency of the code? • Test: put arrays x and y in misaligned common, with a 1-byte character in front of them
Potential Cost of Misaligned Common Blocks • 10,000,000 points used for computing Pi; • Properly aligned, dynamically allocated x and y used 0.064 seconds at 1,100 Mflip/s • Misaligned, statically allocated x and y in common block used 0.834 seconds at 88.4 Mflip/s • Common block alignment slowed computation by a factor of 12
Part I Conclusion • hpmcount can be used to measure the performance of the total code • HPM Toolkit can be used to measure the performance of discrete code sections • Optimization effort must be focused effectively • Fortran90 vector operations are generally faster than Fortran77 scalar operations • Use of automatic SMP parallelization may provide an easy performance boost • I/O may be the largest factor in “whole code” performance • Misaligned common blocks can be very expensive
Part II: Comparing Libraries • In the rich user environment on seaborg, there are many alternative ways to do the same computation • The HPM Toolkit provides the tools to compare alternative approaches to the same computation
Dot Product Functions • User coded scalar computation • User coded vector computaiton • Single processor ESSL ddot • Multi-threaded SMP ESSL ddot • Single processor IMSL ddot • Single processor NAG f06eaf • Multi-threaded SMP NAGf06eaf
Sample Problem • Test Cauchy-Schwartz inequality for N vectors of length N (X•Y)2 <= (X•X)(Y•Y) • Generate 2N random numbers (array x2) • Use 1st N for X; (X•X) computed once • Vary vector Y for i=1,n y = 2.0*x2(i:n+(i-1)) First Y is 2X, second Y is 2(X2(2:N+1)), etc. • Compute (2*N)+1 dot products of length N
Instrumented Code Section for Dot Products call f_hpmstart(1,"Dot products") xx = ddot(n,x,1,x,1) do i=1,n y = 2.0*x2(i:n+(i-1)) yy = ddot(n,y,1,y,1) xy = ddot(n,x,1,y,1) diffs(i) = (xx*yy)-(xy*xy) enddo call f_hpmstop(1)
Two User Coded Functions real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n),dp dp = 0. do i=1,n dp = dp + x(i)*y(i) ! User scalar loop enddo myddot = dp return end real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n) myddot = sum(x*y) ! User vector computation return end
Compile and Run User Functions module load hpmtoolkit echo "100000" > libs.dat setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT-qsuffix=cpp=f" $FC -o libs0 libs0.f ./libs0 <libs.dat $FC -o libs0a libs0a.f ./libs0a <libs.dat
Compile and Run ESSL Versions setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -lessl" $FC -o libs1 libs1.f ./libs1 <libs.dat setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp -lesslsmp" $FC -o libs1smp libs1.f ./libs1smp <libs.dat
Compile and Run IMSL Version module load imsl setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $IMSL" $FC -o libs1imsl libs1.f ./libs1imsl <libs.dat module unload imsl
Compile and Run NAG Versions module load nag_64 setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG" $FC -o libs1nag libsnag.f ./libs1nag <libs.dat module unload nag module load nag_smp64 setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG_SMP6 -qsmp=omp -qnosave " $FC -o libs1nagsmp libsnag.f ./libs1nagsmp <libs.dat module unload nag_smp64
First Comparison of Dot Product(N=100,000) Version Wall Clock (sec)Mflip/sScaled Time (1=Fastest) User Scalar 246 203 1.72 User Vector 249 201 1.74 ESSL 145 346 1.01 ESSL-SMP 408 123 2.85 Slowest IMSL 143 351 1.00 Fastest NAG 250 200 1.75 NAG-SMP 180 278 1.26
Comments on First Comparisons • The best results, by just a little, were obtained using the IMSL library, with ESSL a close second • Third best was the NAG-SMP routine, with benefits from multi-threaded computation • The user coded routines and NAG were about 75% slower than the ESSL and IMSL routines. In general, library routines are highly optimized and better than user coded routines. • The ESSL-SMP library did very poorly on this computation; this unexpected result may be due to data structures in the library, or perhaps the number of threads (default is 16).
ESSL-SMP Performance vs. Number of Threads • All for N=100,000 • Number of threads controlled by environment variable OMP_NUM_THREADS
Revised First Comparison of Dot Product(N=100,000) Version Wall Clock (sec)Mflip/sScaled Time (1=Fastest) User Scalar 246 203 4.9 User Vector 249 201 5.0 ESSL 145 346 2.9 ESSL-SMP 50 1000 1.0 Fastest 4 threads IMSL 143 351 2.9 NAG 250 200 5.0 Slowest NAG-SMP 180 278 3.6 Tuning for Number of Threads is Very, Very Important for SMP codes !
Scaling up the Problem • The first comparisons were for N=100,000 computing 200,001 dot products of vectors of length 100,000 • Second comparison for N=200,000 computes 400,001 dot products of vectors of length 200,000 • Increase computational complexity by a factor of 4.
Second Comparison of Dot Product(N=200,000) Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest) User Scalar 1090 183 2.17 User Vector 1180 169 2.35 Slowest ESSL 739 271 1.47 ESSL-SMP 503 398 1.00 Fastest IMSL 725 276 1.44 NAG 1120 179 2.23 NAG-SMP 864 231 1.72
Comments on Second Comparisons (N=200,000) • Now the best results are from the ESSL-SMP library, with the default 16 threads • The next best group is ESSL, IMSL and NAG-SMP, taking 50-75% longer than the ESSL-SMP routine. • The worst results were seen from NAG (single thread) and the user code routines. • What is the impact of the number of threads on the ESSL-SMP library performance? It is already the best.
ESSL-SMP Performance vs. Number of Threads • All for N=200,000 • Number of threads controlled by environment variable OMP_NUM_THREADS
Revised Second Comparison of Dot Product(N=200,000) Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest) User Scalar 1090 183 7.5 User Vector 1180 169 8.1 Slowest ESSL 739 271 5.1 ESSL-SMP 146 1370 1.0 Fastest (6 threads) IMSL 725 276 5.0 NAG 1120 179 7.7 NAG-SMP 864 231 5.9