360 likes | 373 Views
S3D: Comparing Performance of XT3+XT4 with XT4. Sameer Shende tau-team@cs.uoregon.edu. Acknowledgements. Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]
E N D
S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu
Acknowledgements • Alan Morris [UO] • Kevin Huck [UO] • Allen D. Malony [UO] • Kenneth Roche [ORNL] • Bronis R. de Supinski [LLNL] • John Mellor-Crummey [Rice] • Nick Wright [SDSC] • Jeff Larkin [Cray, Inc.] The performance data presented here is available at: http://www.cs.uoregon.edu/research/tau/s3d
TAU Parallel Performance System • http://www.cs.uoregon.edu/research/tau/ • Multi-level performance instrumentation • Multi-language automatic source instrumentation • Flexible and configurable performance measurement • Widely-ported parallel performance profiling system • Computer system architectures and operating systems • Different programming languages and compilers • Support for multiple parallel programming paradigms • Multi-threading, message passing, mixed-mode, hybrid
The Story So Far... • Scalability study of S3D using TAU • 3D Scatter plots and mapping of ranks to physical processors points to partitioning in XT3/XT4 • Memory and network on XT3 partition cause the rest of the application to slow down • Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly • Ran a 6400 core simulation on an XT4 partition to compare with XT3+XT4 (used #PBS -lfeature=xt4)...
3D Scatter Plots • Plot four routines along X, Y, Z, and Color axes • Each routine has a range (max, min) • Each process (rank) has a unique position along the three axes and a unique color • Allows us to examine the distribution of nodes (clusters)
Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters! Previous work proved: Blue nodes are XT3, Red are XT4
3D Triangle Mesh Display • Plot MPI rank, routine name, and exclusive time along X, Y and Z axes • Color can be shown by a fourth metric • Scalable view • Suitable for very large number of processors
XT3+XT4: MPI_Wait • Gap represents XT3 nodes
3D View: Large MPI_Wait times on most CPUs • To improve performance, we must reduce MPI_Wait time on other cpus
3D View: XT3 Partition, Imbalance On XT3: MPI_Wait takes less time, other routines take more time!
Getting Back to MPI_Wait() • MPI_Wait takes less time on XT3 nodes • Other routines take longer
XT3+XT4: MPI_Wait - Sorted by Exclusive Time • MPI_Wait takes 435.84 seconds on rank 3101 • It takes 15.49 seconds on rank 0! • Rank 3101 is on XT4, rank 0 is on XT3
Improving S3D Performance • Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly and reduce the time spent idling in MPI_Wait
XT4: Mean Profile Sorted by Exclusive Time • MPI_Wait • has moved • down!
Comparing XT4 with XT3+XT4 • MPI_Wait takes 26% of time compared to combined XT3+XT4!
XT4: 3D View • The “exp” loop [~1GFlop] takes most time now!
XT4 Scatter Plot (After) • MPI_Wait takes from 78 to 121 s now!
Comparing Performance • Hypothesis confirmed: XT4 is faster than XT3+XT4 • Inclusive time down from 1935 to 1702 s • 12% improvement • Saved 24853.3 minutes (414 hours) of wallclock time! • Reduction in MPI_Wait time is most significant • 390s (mean) down to 104s (mean) • Lessons learned: • Slower XT3 nodes can have a significant impact on a large scale S3D run • S3D harness testcase does not perform well on non-homogeneous nodes • We recommend running S3D on XT4 partition only! • #PBS -lfeature=xt4
Discussion • Did we get optimal performance on XT4 nodes? • Are the nodes performing at similar rates uniformly now? • Let us see the std. deviation plot of all routines...
XT4: Standard Deviation • IO routines!
WRITE_SAVEFILE • Rank 0 is quicker!
I/O Becomes a Bottleneck: XT3, XT3+XT4... WRITE_SAVEFILE MPI_Wait
Conclusions • Using pure XT4 improved performance by 12% • Need to investigate I/O in XT4/Lustre further to achieve better performance... • Discuss I/O issues with S3D developers
S3D - Building with TAU • Change name of compiler in build/make.XT3 • ftn=> tau_f90.sh • cc => tau_cc.sh • Set compile time environment variables • setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi • Disabled tracking message communication statistics in TAU • MPI_Comm_compare() is not called inside TAU’s MPI wrapper • Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation • setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ • Selective instrumentation file eliminates instrumentation in lightweight routines • Pre-process Fortran source code using cpp before compiling • Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script: • export TAU_THROTTLE=1 • export COUNTER1 GET_TIME_OF_DAY • export COUNTER2 PAPI_FP_INS • export COUNTER3 PAPI_L1_DCM • export COUNTER4 PAPI_TOT_INS • export COUNTER5 PAPI_L2_DCM
Selective Instrumentation in TAU % cat select.tau BEGIN_EXCLUDE_LIST MCADIF GETRATES TRANSPORT_M::MCAVIS_NEW MCEDIF MCACON CKYTCP THERMCHEM_M::MIXCP THERMCHEM_M::MIXENTH THERMCHEM_M::GIBBSENRG_ALL_DIMT CKRHOY MCEVAL4 THERMCHEM_M::HIS THERMCHEM_M::CPS THERMCHEM_M::ENTROPY END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops routine="#" END_INSTRUMENT_SECTION
Getting Access to TAU on Jaguar • set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) • Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* • Makefile.tau-mpi-pdt-pgi (flat profile) • Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) • Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile) • Binaries of S3D can be found in: • ~sameer/scratch/S3D-BINARIES • withtau • papi, multiplecounters, mpi, pdt, pgi options • without_tau
Concluding Discussion • Performance tools must be used effectively • More intelligent performance systems for productive use • Evolve to application-specific performance technology • Deal with scale by “full range” performance exploration • Autonomic and integrated tools • Knowledge-based and knowledge-driven process • Performance observation methods do not necessarily need to change in a fundamental sense • More automatically controlled and efficiently use • Develop next-generation tools and deliver to community • Open source with support by ParaTools, Inc. • http://www.cs.uoregon.edu/research/tau
Support Acknowledgements • Department of Energy (DOE) • Office of Science • LLNL, LANL, ORNL, ASC • PERI