1 / 27

Phase-Based Parallel Performance Profiling

Phase-Based Parallel Performance Profiling. Allen D. Malony , Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University of Oregon. Outline of Talk. Motivation

Download Presentation

Phase-Based Parallel Performance Profiling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phase-Based ParallelPerformance Profiling Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University of Oregon

  2. Outline of Talk • Motivation • Models in parallel scientific applications • Phases and performance mapping • Problem description • Motivating example • Profiling techniques • Flat, callpath, phase profiling • Approach and implementation • Applications • Future work and concluding remarks

  3. Motivation • Scientific applications designed based on models • Computational: structural, logical, numerical models, … • Correctness: execution order, data consistency, … • Performance: expected, factors, parallelism/scalability, … • Computational models form developer’s “mental” model • How the program is intended to behave and perform • Want to relate performance model to computation model • View performance data with respect to “mental” model • Better identify problems and guide tuning decisions • Must link computational abstractions to performance • Bridge semantic gap – measurements  “mental” model

  4. Performance Mapping • General problem of linking performance to computation • Performance mapping (Irvin and Miller, ‘96; Shende, ‘01) • Associate (map) measured performance data • To higher level, semantic representations • Those with model significance to the user • What is the difficulty of making the association • Depends on performance information • performance events/state visible from instrumentation • what performance data can be measured • How the performance information is used in mapping • Difficulty in how performance information is presented • Model-based views (LeBlanc et al., ‘90)

  5. Phases and Performance Mapping • Like to support the association between model and data • Concept of “phases” is common in scientific applications • How developers think about structure, logic, numerics • How performance can be interpreted (Worley, ‘92) • Worthwhile to consider support for phases • In performance measurement • Bridge semantic gap in parallel performance mapping? • tracing has long demonstrated the benefits! (Heath, ‘91) • phase-based analysis and interpretation • Main contribution • Support for phases in parallel performance profiling

  6. Problem Description • Performance measured as a consequence of events • Events represent actions that occur during execution • Events of interest determine performance information • Events have semantics and context (pragmatics) • Semantics • Defines what the event represents • Example: subroutine entry • Context • Properties of the state in which event occurred • Example: subroutine’s calling parent • Interrogate context to map event performance data

  7. stress() Motivating Example – Multi-Physics Application • Assembly of physical objects • Different shapes • Different materials • Calculate physics • Heat transfer • Mechanical stress • Within / between objects • Iterate to error tolerance • How is performance attributed? • Between events (e.g., routines) and execution components • With respect to computational objects (e.g., data objects) heat() MPIrecv() MPIsend() other routines

  8. Context and Standard Profiling • Flat profiles • Context is whole program (i.e., program code) • Performance distribution across (static) program structure • Cannot differentiate dynamics (e.g., callpath or objects) • Callgraph / callpath profiles • Identify parent-child calling relationships at exectution • Context is calling (event) parent / calling (event) path • Extend event semantics to encode context • create new event with callpath name • requires dynamic event creation for complex callpaths • burdens event mechanisms for context identification • simple performance associations require many events

  9. Context and Phase Profiling • View the program execution as collection of phases • Transition between phases (sequenced, nested) • easiest to think of as phase hierarchy (or phase graph) • Phases are not events • phase boundaries can mark entry/exit events • Context is the current phase • How do we know what phase we are in? • Phases are identified separately from events • phases are not encoded in event names • event mechanisms are not overloaded • A phase profile is event performance attributed to phases • Phase-specific performance profiles (flat or callpath)

  10. Approach (Flat Profile) • Create a profile object for each entry/exit event • Each profile object has a name • Static profile object (static event) • event has a single instance (single name) • Dynamic profile object (dynamic event) • event can have multiple instances (created dynamically) • Inclusive and exclusive performance statistics • Must maintain an event stack (or callstack) • Context are generally thought of as code locations • Dynamic events do allow for dynamic context awareness • User code can check “state” and create new events • BUT only see one level of event!

  11. Approach (Callpath Profile) • Show event calling (nesting) relationships • Create a profile object for each event calling context • Each profile object has a name that encodes the callpath • Static profile object • callpath has a single instance (single name) • Dynamic profile object • callpath can have multiple instances (created dynamically) • Reuse event mechanisms • Interrogate the event stack to form event names • “main=> f1 => f2 => MPI_Send” • Inclusive and exclusive performance statistics • Callpath length and callgraph depth options

  12. Approach (Phase Profile) • A phase is an execution abstraction • Two questions • How to inform the measurement systems about phases? • How to collect the performance data? • Create a phase object when new phase is created • Each phase object has a name • Static and dynamic phase objects • Phase relationships • Phases may be nested (cannot overlap) • “Active” phase object follows scoping rules • Default (top-level) phase is outermost event (e.g., main)

  13. Approach (Phase Profile - API) • Phase creationTAU_PHASE_CREATE_STATIC(var, name, type, group)TAU_PHASE_CREATE_DYNAMIC(var, name, type, group) TAU_GLOBAL_PHASE(var, name, type, group)TAU_GLOBAL_PHASE_EXTERNAL(var) • Global phases have global scope (accessible anywhere) • External declarations for defined phases outside file scope • Phase control TAU_PHASE_START(var)TAU_PHASE_STOP(var)TAU_GLOBAL_PHASE_START(var)TAU_GLOBAL_PHASE_STOP(var) • Collects a callgraph profile (depth 2) PER PHASE! • Phases default as standard events (when disable)

  14. Approach (Phase Profile - Data Collection) • Leverages performance mapping and callpath profiling • Phase entry • Phase object pushed to measurement (event) callstack • Phase / event entry • Need to determine (event, phase) tuple • traverse callstack to find enclosing phase • construct key for (event, phase) tuple • Maintain global map • new keys for new (event, phase) tuples put into global map • create new profile object for every (event, phase) tuple • search global map to determine is tuple occurred before • Use mapping support to store performance data on exit

  15. Multi-Physics Example Instrumentation phases iteratephase events heat phase heat() MPIrecv() only two events! stress phase stress() MPIsend() other routines

  16. Implementation • Parallel profiling in the TAU performance system • Flat profiling • Callpath and callgraph (2-level callpath) profiling • Phase profiling • Multiple performance metrics • Execution time • Hardware performance counters (using PAPI) • Scalable to tens of thousands of processors • Profile analysis and data management tools • ParaProf parallel profile analyzer / visualizer • PerfDMF parallel profile database

  17. Application – NAS Parallel Benchmarks • Phase profiling can provide more refined profile results • Specific to phase localities • Defining phases is an application-specific issue • Apply understanding of computational models • Unfortunately, we were not the application developers • How to decide on phases and phase instrumentation? • Informed by application documentation and code • Look at NAS parallel benchmark application suite • Identify benchmarks with phase behavior • SP, BT, LU (simulated CFD codes) and CG • Focus on BT

  18. NAS BT – Phase Analysis • Emulates a CFD application • System of linear equations • Implicit finite-difference discretization of Navier-Stokes • Solve three sets of uncoupled systems of equations • in X, Y, Z directions • Block tridiagonal with 5x5 blocks • Square number of processors • Phase analysis • Highlight performance for each solution direction • Identified in code by three main functions • x_solve, y_solve, z_solve • Static phases

  19. NAS BT – Instrumentation call TAU_PHASE_CREATE_STATIC(xsolvephase,’x_solve phase’) call TAU_PHASE_START(xsolvephase) call x_solve call TAU_PHASE_STOP(xsolvephase) call TAU_PHASE_CREATE_STATIC(ysolvephase,’y_solve phase’) call TAU_PHASE_START(ysolvephase) call y_solve call TAU_PHASE_STOP(ysolvephase) call TAU_PHASE_CREATE_STATIC(zsolvephase,’z_solve phase’) call TAU_PHASE_START(zsolvephase) call z_solve call TAU_PHASE_STOP(zsolvephase)

  20. NAS BT – Flat Profile How is MPI_Wait()distributed relative tosolver direction? Application routine names reflect phase semantics

  21. NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

  22. Application – MFIX • Multiphase Flow with Interphase eXchanges (MFIX) • National Energy Transfer Laboratory (NETL) • Study physical/chemistry properties in fluid-solid systems • hydrodynamics, heat transfer, chemical reactions • Characteristic of large-scale iterative simulations • major loop executed as simulation advances in time • Testcase • Models Ozone decomposition in a bubbling fluidized bed • Flat profile • Iterate phase profile • Demonstrate dynamic phases

  23. MFIX– Phase Instrumentation (ITERATE) SUBROUTINE ITERATE(IER, NIT) character(11) taucharary integer tauiteration / 0 / integer profiler(2) / 0, 0 / save profiler, tauiteration write (taucharary, ’(a8,i3)’) ’ITERATE ’, tauiteration tauiteration = tauiteration + 1 call TAU_PHASE_CREATE_DYNAMIC(profiler,taucharary) call TAU_PHASE_START(profiler) ! WORK call TAU_PHASE_STOP(profiler) END SUBROUTINE ITERATE

  24. MFIX – Phase Profile (MPI_Waitall) In 51st iteration, time spent in MPI_Waitall was 85.81 secs dynamic phases one per interation Total time spent in MPI_Waitall was 4137.9 secs across all 92 iterations

  25. MFIX Iterate Phase Behavior

  26. Concluding Discussion and Future Work • Phased-based profiling can help to bridge semantic gap • Computational models  performance measurements • Application-specific performance analysis • Implemented phase profiling in TAU • Demonstrated phase profiling • NAS BT benchmark and MFIX application • Also used in S3D, Uintah, Flash on large-scale platforms • Requires application-specific knowledge • Might be possible to link to auto phase identification • Based on memory tracing or application state change • Can this idea be extended to global parallel phases? • Working on better ways to present phase performance

  27. Support Acknowledgements • Department of Energy (DOE) • Office of Science contracts • University of Utah ASCI Level 1 sub-contract • ASC/NNSA Level 3 contract • Department of Defense (DoD) • HPC Modernization Office (HPCMO) • Programming Environment and Training (PET) • NSF • Research Centre Juelich • Los Alamos National Laboratory • www.cs.uoregon.edu/research/paracomp/tau

More Related