270 likes | 388 Views
Phase-Based Parallel Performance Profiling. Allen D. Malony , Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University of Oregon. Outline of Talk. Motivation
E N D
Phase-Based ParallelPerformance Profiling Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University of Oregon
Outline of Talk • Motivation • Models in parallel scientific applications • Phases and performance mapping • Problem description • Motivating example • Profiling techniques • Flat, callpath, phase profiling • Approach and implementation • Applications • Future work and concluding remarks
Motivation • Scientific applications designed based on models • Computational: structural, logical, numerical models, … • Correctness: execution order, data consistency, … • Performance: expected, factors, parallelism/scalability, … • Computational models form developer’s “mental” model • How the program is intended to behave and perform • Want to relate performance model to computation model • View performance data with respect to “mental” model • Better identify problems and guide tuning decisions • Must link computational abstractions to performance • Bridge semantic gap – measurements “mental” model
Performance Mapping • General problem of linking performance to computation • Performance mapping (Irvin and Miller, ‘96; Shende, ‘01) • Associate (map) measured performance data • To higher level, semantic representations • Those with model significance to the user • What is the difficulty of making the association • Depends on performance information • performance events/state visible from instrumentation • what performance data can be measured • How the performance information is used in mapping • Difficulty in how performance information is presented • Model-based views (LeBlanc et al., ‘90)
Phases and Performance Mapping • Like to support the association between model and data • Concept of “phases” is common in scientific applications • How developers think about structure, logic, numerics • How performance can be interpreted (Worley, ‘92) • Worthwhile to consider support for phases • In performance measurement • Bridge semantic gap in parallel performance mapping? • tracing has long demonstrated the benefits! (Heath, ‘91) • phase-based analysis and interpretation • Main contribution • Support for phases in parallel performance profiling
Problem Description • Performance measured as a consequence of events • Events represent actions that occur during execution • Events of interest determine performance information • Events have semantics and context (pragmatics) • Semantics • Defines what the event represents • Example: subroutine entry • Context • Properties of the state in which event occurred • Example: subroutine’s calling parent • Interrogate context to map event performance data
stress() Motivating Example – Multi-Physics Application • Assembly of physical objects • Different shapes • Different materials • Calculate physics • Heat transfer • Mechanical stress • Within / between objects • Iterate to error tolerance • How is performance attributed? • Between events (e.g., routines) and execution components • With respect to computational objects (e.g., data objects) heat() MPIrecv() MPIsend() other routines
Context and Standard Profiling • Flat profiles • Context is whole program (i.e., program code) • Performance distribution across (static) program structure • Cannot differentiate dynamics (e.g., callpath or objects) • Callgraph / callpath profiles • Identify parent-child calling relationships at exectution • Context is calling (event) parent / calling (event) path • Extend event semantics to encode context • create new event with callpath name • requires dynamic event creation for complex callpaths • burdens event mechanisms for context identification • simple performance associations require many events
Context and Phase Profiling • View the program execution as collection of phases • Transition between phases (sequenced, nested) • easiest to think of as phase hierarchy (or phase graph) • Phases are not events • phase boundaries can mark entry/exit events • Context is the current phase • How do we know what phase we are in? • Phases are identified separately from events • phases are not encoded in event names • event mechanisms are not overloaded • A phase profile is event performance attributed to phases • Phase-specific performance profiles (flat or callpath)
Approach (Flat Profile) • Create a profile object for each entry/exit event • Each profile object has a name • Static profile object (static event) • event has a single instance (single name) • Dynamic profile object (dynamic event) • event can have multiple instances (created dynamically) • Inclusive and exclusive performance statistics • Must maintain an event stack (or callstack) • Context are generally thought of as code locations • Dynamic events do allow for dynamic context awareness • User code can check “state” and create new events • BUT only see one level of event!
Approach (Callpath Profile) • Show event calling (nesting) relationships • Create a profile object for each event calling context • Each profile object has a name that encodes the callpath • Static profile object • callpath has a single instance (single name) • Dynamic profile object • callpath can have multiple instances (created dynamically) • Reuse event mechanisms • Interrogate the event stack to form event names • “main=> f1 => f2 => MPI_Send” • Inclusive and exclusive performance statistics • Callpath length and callgraph depth options
Approach (Phase Profile) • A phase is an execution abstraction • Two questions • How to inform the measurement systems about phases? • How to collect the performance data? • Create a phase object when new phase is created • Each phase object has a name • Static and dynamic phase objects • Phase relationships • Phases may be nested (cannot overlap) • “Active” phase object follows scoping rules • Default (top-level) phase is outermost event (e.g., main)
Approach (Phase Profile - API) • Phase creationTAU_PHASE_CREATE_STATIC(var, name, type, group)TAU_PHASE_CREATE_DYNAMIC(var, name, type, group) TAU_GLOBAL_PHASE(var, name, type, group)TAU_GLOBAL_PHASE_EXTERNAL(var) • Global phases have global scope (accessible anywhere) • External declarations for defined phases outside file scope • Phase control TAU_PHASE_START(var)TAU_PHASE_STOP(var)TAU_GLOBAL_PHASE_START(var)TAU_GLOBAL_PHASE_STOP(var) • Collects a callgraph profile (depth 2) PER PHASE! • Phases default as standard events (when disable)
Approach (Phase Profile - Data Collection) • Leverages performance mapping and callpath profiling • Phase entry • Phase object pushed to measurement (event) callstack • Phase / event entry • Need to determine (event, phase) tuple • traverse callstack to find enclosing phase • construct key for (event, phase) tuple • Maintain global map • new keys for new (event, phase) tuples put into global map • create new profile object for every (event, phase) tuple • search global map to determine is tuple occurred before • Use mapping support to store performance data on exit
Multi-Physics Example Instrumentation phases iteratephase events heat phase heat() MPIrecv() only two events! stress phase stress() MPIsend() other routines
Implementation • Parallel profiling in the TAU performance system • Flat profiling • Callpath and callgraph (2-level callpath) profiling • Phase profiling • Multiple performance metrics • Execution time • Hardware performance counters (using PAPI) • Scalable to tens of thousands of processors • Profile analysis and data management tools • ParaProf parallel profile analyzer / visualizer • PerfDMF parallel profile database
Application – NAS Parallel Benchmarks • Phase profiling can provide more refined profile results • Specific to phase localities • Defining phases is an application-specific issue • Apply understanding of computational models • Unfortunately, we were not the application developers • How to decide on phases and phase instrumentation? • Informed by application documentation and code • Look at NAS parallel benchmark application suite • Identify benchmarks with phase behavior • SP, BT, LU (simulated CFD codes) and CG • Focus on BT
NAS BT – Phase Analysis • Emulates a CFD application • System of linear equations • Implicit finite-difference discretization of Navier-Stokes • Solve three sets of uncoupled systems of equations • in X, Y, Z directions • Block tridiagonal with 5x5 blocks • Square number of processors • Phase analysis • Highlight performance for each solution direction • Identified in code by three main functions • x_solve, y_solve, z_solve • Static phases
NAS BT – Instrumentation call TAU_PHASE_CREATE_STATIC(xsolvephase,’x_solve phase’) call TAU_PHASE_START(xsolvephase) call x_solve call TAU_PHASE_STOP(xsolvephase) call TAU_PHASE_CREATE_STATIC(ysolvephase,’y_solve phase’) call TAU_PHASE_START(ysolvephase) call y_solve call TAU_PHASE_STOP(ysolvephase) call TAU_PHASE_CREATE_STATIC(zsolvephase,’z_solve phase’) call TAU_PHASE_START(zsolvephase) call z_solve call TAU_PHASE_STOP(zsolvephase)
NAS BT – Flat Profile How is MPI_Wait()distributed relative tosolver direction? Application routine names reflect phase semantics
NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events
Application – MFIX • Multiphase Flow with Interphase eXchanges (MFIX) • National Energy Transfer Laboratory (NETL) • Study physical/chemistry properties in fluid-solid systems • hydrodynamics, heat transfer, chemical reactions • Characteristic of large-scale iterative simulations • major loop executed as simulation advances in time • Testcase • Models Ozone decomposition in a bubbling fluidized bed • Flat profile • Iterate phase profile • Demonstrate dynamic phases
MFIX– Phase Instrumentation (ITERATE) SUBROUTINE ITERATE(IER, NIT) character(11) taucharary integer tauiteration / 0 / integer profiler(2) / 0, 0 / save profiler, tauiteration write (taucharary, ’(a8,i3)’) ’ITERATE ’, tauiteration tauiteration = tauiteration + 1 call TAU_PHASE_CREATE_DYNAMIC(profiler,taucharary) call TAU_PHASE_START(profiler) ! WORK call TAU_PHASE_STOP(profiler) END SUBROUTINE ITERATE
MFIX – Phase Profile (MPI_Waitall) In 51st iteration, time spent in MPI_Waitall was 85.81 secs dynamic phases one per interation Total time spent in MPI_Waitall was 4137.9 secs across all 92 iterations
Concluding Discussion and Future Work • Phased-based profiling can help to bridge semantic gap • Computational models performance measurements • Application-specific performance analysis • Implemented phase profiling in TAU • Demonstrated phase profiling • NAS BT benchmark and MFIX application • Also used in S3D, Uintah, Flash on large-scale platforms • Requires application-specific knowledge • Might be possible to link to auto phase identification • Based on memory tracing or application state change • Can this idea be extended to global parallel phases? • Working on better ways to present phase performance
Support Acknowledgements • Department of Energy (DOE) • Office of Science contracts • University of Utah ASCI Level 1 sub-contract • ASC/NNSA Level 3 contract • Department of Defense (DoD) • HPC Modernization Office (HPCMO) • Programming Environment and Training (PET) • NSF • Research Centre Juelich • Los Alamos National Laboratory • www.cs.uoregon.edu/research/paracomp/tau