1 / 50

S3D: Performance Impact of Hybrid XT3/XT4

S3D: Performance Impact of Hybrid XT3/XT4. Sameer Shende tau-team@cs.uoregon.edu. Acknowledgements. Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]

Download Presentation

S3D: Performance Impact of Hybrid XT3/XT4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende tau-team@cs.uoregon.edu

  2. Acknowledgements • Alan Morris [UO] • Kevin Huck [UO] • Allen D. Malony [UO] • Kenneth Roche [ORNL] • Bronis R. de Supinski [LLNL] • John Mellor-Crummey [Rice] • Nick Wright [SDSC] • Jeff Larkin [Cray, Inc.] The performance data presented here is available at: http://www.cs.uoregon.edu/research/tau/s3d

  3. TAU Parallel Performance System • http://www.cs.uoregon.edu/research/tau/ • Multi-level performance instrumentation • Multi-language automatic source instrumentation • Flexible and configurable performance measurement • Widely-ported parallel performance profiling system • Computer system architectures and operating systems • Different programming languages and compilers • Support for multiple parallel programming paradigms • Multi-threading, message passing, mixed-mode, hybrid

  4. The Story So Far... • Scalability study of S3D using TAU • MPI_Wait • I/O (WRITE_SAVEFILE) • Loop: ComputeSpeciesDiffFlux (630-656) [Rice, SDSC] • Loop: ReactionRateBounds (374-386) [exp] • 3D Scatter plots pointed to a single “slow” node before • Identifying individual nodes by mapping ranks to nodes within TAU • Cray utilities: nodeinfo, xtshowmesh, xtshowcabs • Ran a 6400 core simulation to identify XT3/XT4 partition performance issues (removed -feature=xt3)

  5. Total Runtime Breakdown by Events - Time WRITE_SAVEFILE MPI_Wait

  6. Relative Efficiency

  7. MPI Scaling

  8. Relative Efficiency & Speedup for One Event

  9. ParaProf’s Source Browser (8 core profile)

  10. Case Study • Harness testcase • Platform: Jaguar Combined Cray XT3/XT4 at ORNL • 6400p • Goal: • To evaluate the performance impact of combined XT3/XT4 nodes on S3D executions • Performance evaluation of MPI_Wait • Study mapping of MPI ranks to nodes

  11. TAU: ParaProf Profile

  12. Overall Mean Profile: Exclusive Wallclock Time

  13. Overall Inclusive Time

  14. Mean Mflops observed over all ranks

  15. Inclusive Total Instructions Executed

  16. Total Instructions Executed (Exclusive)

  17. Comparing Exclusive PAPI Counters, MFlops

  18. 3D Scatter Plots • Plot four routines along X, Y, Z, and Color axes • Each routine has a range (max, min) • Each process (rank) has a unique position along the three axes and a unique color • Allows us to examine the distribution of nodes (clusters)

  19. Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters!

  20. 3D Triangle Mesh Display • Plot MPI rank, routine name, and exclusive time along X, Y and Z axes • Color can be shown by a fourth metric • Scalable view • Suitable for very large number of processors

  21. MPI_Wait: 3D View

  22. 3D View: Zooming In... Jagged Edges!

  23. 3D View: Uh Oh!

  24. Zoom, Change Color to L1 Data Cache Misses • Loop in ComputeSpeciesDiffFlux (630-656) has high L1 DCMs (red) • Takes longer to execute on this “slice” of processors. So do other routines. Slower memory?

  25. Changing Color to MFLOPS • Loop in ComputeSpeciesDiffFlux (630-656) lower Mflops (dark blue)

  26. Getting Back to MPI_Wait() • Why does MPI_Wait take less time on these cores? • What does the profile of MPI_Wait look like?

  27. MPI_Wait - Sorted by Exclusive Time • MPI_Wait takes 435.84 seconds on rank 3101 • It takes 59.6 s on rank 3233 and 29.2 s on rank 3200 • It takes 15.49 seconds on rank 0! • How is rank 3101 different from rank 0?

  28. Comparing Ranks 3101 and 0 (extremes)

  29. Comparing Inclusive Times - Same for S3D

  30. Comparing PAPI Floating Point Instructions • PAPI_FP_INS are the same - as expected

  31. Comparing Performance - MFLOPS • For the memory intensive loop in ComputeSpeciesDiffFlux, • rank 0 gets 65% Mflops of rank 3101 (114 vs 174 Mflops)!

  32. Comparing MFLOPS: Rank 3101 vs Rank 0 • Rank 0 appears to be “slower” than rank 3101 • Are there other nodes that are similarly slow with less wait times? • How does the MPI_Wait profile look like over all nodes?

  33. MPI_Wait Profile What is this rank?

  34. MPI_Wait Profile Shifts at rank 114! • Ranks 0 through 113 take less time in MPI_Wait than 114...

  35. Another Shift in MPI_Wait() • This shift is observed in ranks 3200 through 3313 • Again 114 processors... (like ranks 0 through 113) • Hmm... • How do other routines perform on these ranks? • What are the physical node ids?

  36. MPI_Wait • While MPI_Wait takes • less time on these cpus, • other routines take longer • Points to a load imbalance!

  37. Identifying Physical Processors using Metadata

  38. MetaData for Ranks 3200 and 0 • Rank 3200 and 0 both lie on the same physical node nid03406!

  39. Mapping Ranks from TAU to Physical Processors • Ranks 0..113 lie on • processors 3406..3551 • Ranks 3200..3313 are also on 3406..3551

  40. Results from Cray’s nodeinfo Utility • Processors 3406..3551 (physical ids) are located on the XT3 partition • XT3 partition has slow DDR-400 memory (5986 MB/s) • XT3 has a slower SS1 (1109 MB/s) interconnect • XT4 partition has faster DDR2-667 memory modules (7147 MB/s) and • faster Seastar2 (SS2) (2022 MB/s) interconnect

  41. Location of Physical Nodes in the Cabinets • Using Cray utilities xtshowcabs, and xtshowmesh utilities • All nodes marked with a Job “c” came from our S3D job

  42. xtshowcabs • Nodes marked with a “c” are from our S3D run • What does the mesh look like?

  43. xtshowmesh (1 of 2) • Nodes marked with a “c” are from our S3D run

  44. xtshowmesh (2 of 2) • Nodes marked with a “c” are from our S3D run

  45. Conclusions • Using a combination of XT3/XT4 nodes slowed down parts of S3D • The application spends a considerable amount of time spinning/polling in MPI_Wait • The load imbalance is probably caused by non-uniform nodes • Conducted a performance characterization of S3D • This data will help derive communication models that explain the performance data observed [John Mellor-Crummey, Rice] • Techniques to improve cache memory utilization in the loops identified by TAU will help overall performance [SDSC, Rice] • I/O characterization of S3D will help identify I/O scaling issues

  46. S3D - Building with TAU • Change name of compiler in build/make.XT3 • ftn=> tau_f90.sh • cc => tau_cc.sh • Set compile time environment variables • setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi • Disabled tracking message communication statistics in TAU • MPI_Comm_compare() is not called inside TAU’s MPI wrapper • Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation • setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ • Selective instrumentation file eliminates instrumentation in lightweight routines • Pre-process Fortran source code using cpp before compiling • Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script: • export TAU_THROTTLE=1 • export COUNTER1 GET_TIME_OF_DAY • export COUNTER2 PAPI_FP_INS • export COUNTER3 PAPI_L1_DCM • export COUNTER4 PAPI_TOT_INS • export COUNTER5 PAPI_L2_DCM

  47. Selective Instrumentation in TAU % cat select.tau BEGIN_EXCLUDE_LIST MCADIF GETRATES TRANSPORT_M::MCAVIS_NEW MCEDIF MCACON CKYTCP THERMCHEM_M::MIXCP THERMCHEM_M::MIXENTH THERMCHEM_M::GIBBSENRG_ALL_DIMT CKRHOY MCEVAL4 THERMCHEM_M::HIS THERMCHEM_M::CPS THERMCHEM_M::ENTROPY END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops routine="#" END_INSTRUMENT_SECTION

  48. Getting Access to TAU on Jaguar • set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) • Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* • Makefile.tau-mpi-pdt-pgi (flat profile) • Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) • Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile) • Binaries of S3D can be found in: • ~sameer/scratch/S3D-BINARIES • withtau • papi, multiplecounters, mpi, pdt, pgi options • without_tau

  49. Concluding Discussion • Performance tools must be used effectively • More intelligent performance systems for productive use • Evolve to application-specific performance technology • Deal with scale by “full range” performance exploration • Autonomic and integrated tools • Knowledge-based and knowledge-driven process • Performance observation methods do not necessarily need to change in a fundamental sense • More automatically controlled and efficiently use • Develop next-generation tools and deliver to community • Open source with support by ParaTools, Inc. • http://www.cs.uoregon.edu/research/tau

  50. Support Acknowledgements • Department of Energy (DOE) • Office of Science • LLNL, LANL, ORNL, ASC • PERI

More Related