500 likes | 584 Views
S3D: Performance Impact of Hybrid XT3/XT4. Sameer Shende tau-team@cs.uoregon.edu. Acknowledgements. Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]
E N D
S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende tau-team@cs.uoregon.edu
Acknowledgements • Alan Morris [UO] • Kevin Huck [UO] • Allen D. Malony [UO] • Kenneth Roche [ORNL] • Bronis R. de Supinski [LLNL] • John Mellor-Crummey [Rice] • Nick Wright [SDSC] • Jeff Larkin [Cray, Inc.] The performance data presented here is available at: http://www.cs.uoregon.edu/research/tau/s3d
TAU Parallel Performance System • http://www.cs.uoregon.edu/research/tau/ • Multi-level performance instrumentation • Multi-language automatic source instrumentation • Flexible and configurable performance measurement • Widely-ported parallel performance profiling system • Computer system architectures and operating systems • Different programming languages and compilers • Support for multiple parallel programming paradigms • Multi-threading, message passing, mixed-mode, hybrid
The Story So Far... • Scalability study of S3D using TAU • MPI_Wait • I/O (WRITE_SAVEFILE) • Loop: ComputeSpeciesDiffFlux (630-656) [Rice, SDSC] • Loop: ReactionRateBounds (374-386) [exp] • 3D Scatter plots pointed to a single “slow” node before • Identifying individual nodes by mapping ranks to nodes within TAU • Cray utilities: nodeinfo, xtshowmesh, xtshowcabs • Ran a 6400 core simulation to identify XT3/XT4 partition performance issues (removed -feature=xt3)
Total Runtime Breakdown by Events - Time WRITE_SAVEFILE MPI_Wait
Case Study • Harness testcase • Platform: Jaguar Combined Cray XT3/XT4 at ORNL • 6400p • Goal: • To evaluate the performance impact of combined XT3/XT4 nodes on S3D executions • Performance evaluation of MPI_Wait • Study mapping of MPI ranks to nodes
3D Scatter Plots • Plot four routines along X, Y, Z, and Color axes • Each routine has a range (max, min) • Each process (rank) has a unique position along the three axes and a unique color • Allows us to examine the distribution of nodes (clusters)
3D Triangle Mesh Display • Plot MPI rank, routine name, and exclusive time along X, Y and Z axes • Color can be shown by a fourth metric • Scalable view • Suitable for very large number of processors
Zoom, Change Color to L1 Data Cache Misses • Loop in ComputeSpeciesDiffFlux (630-656) has high L1 DCMs (red) • Takes longer to execute on this “slice” of processors. So do other routines. Slower memory?
Changing Color to MFLOPS • Loop in ComputeSpeciesDiffFlux (630-656) lower Mflops (dark blue)
Getting Back to MPI_Wait() • Why does MPI_Wait take less time on these cores? • What does the profile of MPI_Wait look like?
MPI_Wait - Sorted by Exclusive Time • MPI_Wait takes 435.84 seconds on rank 3101 • It takes 59.6 s on rank 3233 and 29.2 s on rank 3200 • It takes 15.49 seconds on rank 0! • How is rank 3101 different from rank 0?
Comparing PAPI Floating Point Instructions • PAPI_FP_INS are the same - as expected
Comparing Performance - MFLOPS • For the memory intensive loop in ComputeSpeciesDiffFlux, • rank 0 gets 65% Mflops of rank 3101 (114 vs 174 Mflops)!
Comparing MFLOPS: Rank 3101 vs Rank 0 • Rank 0 appears to be “slower” than rank 3101 • Are there other nodes that are similarly slow with less wait times? • How does the MPI_Wait profile look like over all nodes?
MPI_Wait Profile What is this rank?
MPI_Wait Profile Shifts at rank 114! • Ranks 0 through 113 take less time in MPI_Wait than 114...
Another Shift in MPI_Wait() • This shift is observed in ranks 3200 through 3313 • Again 114 processors... (like ranks 0 through 113) • Hmm... • How do other routines perform on these ranks? • What are the physical node ids?
MPI_Wait • While MPI_Wait takes • less time on these cpus, • other routines take longer • Points to a load imbalance!
MetaData for Ranks 3200 and 0 • Rank 3200 and 0 both lie on the same physical node nid03406!
Mapping Ranks from TAU to Physical Processors • Ranks 0..113 lie on • processors 3406..3551 • Ranks 3200..3313 are also on 3406..3551
Results from Cray’s nodeinfo Utility • Processors 3406..3551 (physical ids) are located on the XT3 partition • XT3 partition has slow DDR-400 memory (5986 MB/s) • XT3 has a slower SS1 (1109 MB/s) interconnect • XT4 partition has faster DDR2-667 memory modules (7147 MB/s) and • faster Seastar2 (SS2) (2022 MB/s) interconnect
Location of Physical Nodes in the Cabinets • Using Cray utilities xtshowcabs, and xtshowmesh utilities • All nodes marked with a Job “c” came from our S3D job
xtshowcabs • Nodes marked with a “c” are from our S3D run • What does the mesh look like?
xtshowmesh (1 of 2) • Nodes marked with a “c” are from our S3D run
xtshowmesh (2 of 2) • Nodes marked with a “c” are from our S3D run
Conclusions • Using a combination of XT3/XT4 nodes slowed down parts of S3D • The application spends a considerable amount of time spinning/polling in MPI_Wait • The load imbalance is probably caused by non-uniform nodes • Conducted a performance characterization of S3D • This data will help derive communication models that explain the performance data observed [John Mellor-Crummey, Rice] • Techniques to improve cache memory utilization in the loops identified by TAU will help overall performance [SDSC, Rice] • I/O characterization of S3D will help identify I/O scaling issues
S3D - Building with TAU • Change name of compiler in build/make.XT3 • ftn=> tau_f90.sh • cc => tau_cc.sh • Set compile time environment variables • setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi • Disabled tracking message communication statistics in TAU • MPI_Comm_compare() is not called inside TAU’s MPI wrapper • Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation • setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ • Selective instrumentation file eliminates instrumentation in lightweight routines • Pre-process Fortran source code using cpp before compiling • Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script: • export TAU_THROTTLE=1 • export COUNTER1 GET_TIME_OF_DAY • export COUNTER2 PAPI_FP_INS • export COUNTER3 PAPI_L1_DCM • export COUNTER4 PAPI_TOT_INS • export COUNTER5 PAPI_L2_DCM
Selective Instrumentation in TAU % cat select.tau BEGIN_EXCLUDE_LIST MCADIF GETRATES TRANSPORT_M::MCAVIS_NEW MCEDIF MCACON CKYTCP THERMCHEM_M::MIXCP THERMCHEM_M::MIXENTH THERMCHEM_M::GIBBSENRG_ALL_DIMT CKRHOY MCEVAL4 THERMCHEM_M::HIS THERMCHEM_M::CPS THERMCHEM_M::ENTROPY END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops routine="#" END_INSTRUMENT_SECTION
Getting Access to TAU on Jaguar • set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) • Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* • Makefile.tau-mpi-pdt-pgi (flat profile) • Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) • Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile) • Binaries of S3D can be found in: • ~sameer/scratch/S3D-BINARIES • withtau • papi, multiplecounters, mpi, pdt, pgi options • without_tau
Concluding Discussion • Performance tools must be used effectively • More intelligent performance systems for productive use • Evolve to application-specific performance technology • Deal with scale by “full range” performance exploration • Autonomic and integrated tools • Knowledge-based and knowledge-driven process • Performance observation methods do not necessarily need to change in a fundamental sense • More automatically controlled and efficiently use • Develop next-generation tools and deliver to community • Open source with support by ParaTools, Inc. • http://www.cs.uoregon.edu/research/tau
Support Acknowledgements • Department of Energy (DOE) • Office of Science • LLNL, LANL, ORNL, ASC • PERI