Performance Tools Interface for OpenMP: A Presentation to the OpenMP Futures Committee

Performance Tools Interface for OpenMPA presentation to the OpenMP Futures Committee Allen D. Malony malony@cs.uoregon.edu Computer & Information Science Department Computational Science Institute University of Oregon

Outline • Goals for OpenMP performance tools interface • Performance state and event model • Fork-join execution states and events • Performance measurement model • Event generation (callback) interface • Proposal based on directive transformation • Sample transformations • Comments and other additions • Describing execution context • General Issues • Experience using TAU with OpenMP/MPI

Goals for an OMP Performance Tools Interface • Goal 1: Expose OpenMP events and execution states to a performance measurement system • What are the OpenMP events / states of interest? • What is the nature (mechanism) of the interface? • Goal 2: Make the performance measurement interface portable • “Standardize” on interface mechanism • Define interface semantics and information • Goal 3: Support source-level and compiler-level implementation of interface • Source transformation and compiler transformation

Performance State and Event Model • Based on performance model for (nested) fork-join parallelism, multi-threaded work-sharing, and thread-based synchronization • Define with respect to multi-level state view • Level 1: serial and parallel states (with nesting) • Level 2: work-sharing states (per team thread) • Level 3: synchronization states (per team thread) • Level 4: runtime system (thread) states • Events reflect state transitions • State enter / exit (begin / end) • State graph with event edges

Fork-Join Execution States and Events Events masterslave Parallel region operation master starts serial execution X S parallel region begins X slaves started X team begins parallel execution X X P team threads hit barrier X X slaves end; master exits barrier X X master resumes serial execution X S

Performance Measurement Model • Serial performance • Detect serial transition points • Standard events and statistics within serial regions • Time spent in serial execution • Locations of serial execution in program • Parallel performance • Detect parallel transitions points • Time spent in parallel execution • Region perspective and work-sharing perspective • Performance profiles kept per region • More complex parallel states of execution

Event Generation (Callback) Interface • Generic event callback function (pseudo format) • omperf(eventID, contextID[, data]) • Single callback routine • Must define events (not necessarily standardize) • Place burden on callback routine to interpret eventID • omperf_{begin/end}(eventID, contextID[, data]) • Directive-specific callback functions (pseudo format) • omperf_{directive}_{begin/end/…}(contextID[, data]) • Standardize function names • What about execution context data?

Instrumentation Alternatives • Source-level instrumentation • Manual instrumentation • TAU performance measurement • Directive transformation • Compiler instrumentation • Could allow more efficient implementation • JOMP (EPCC), Java Instrumentation Suite (Barcelona) • Runtime system instrumentation • Use to see RTL events, in addition to OMP events • GuideView (KAI/Intel) • Dynamic instrumentation

Proposal Based on Directive Transformation • Consider source-level approach • For each OMP directive, generate an “instrumented” version which calls the performance event API. • What is the event model for each directive? • Issues • OMP RTL execution behavior is not fully exposed • May not be able to generate equivalent form • Possible conflicts with directive optimization • May be less efficient • Hard to access RTL events and information • Sample transformations (B. Mohr, KFA)

Parallel region (parallel) #omp parallelomperf_parallel_fork(regionID)#omp parallelomperf_parallel_begin(regionID) #omp end parallelomperf_parallel_end(regionID)omperf_barrier_begin(regionID)#omp barrieromperf_barrier_end(regionID)#omp end parallel omperf_parallel_join(regionID) #omp is just pseudo notation IDs vs. context descriptor (see below) Work-sharing (do/for) #omp doomperf_do_begin(loopID)#omp do #omp end do nowait#omp end do nowaitomperf_do_end(loopID) #omp end do#omp end do nowait omperf_do_end(loopID)omperf_barrier_begin(loopID)#omp barrieromperf_barrier_end(loopID) Example: parallel regions, work-sharing (do)

Work-sharing (sections) #omp sectionsomperf_sections_begin(sectionsID)#omp sections #omp section (first section only)#omp sectionomperf_section_begin(sectionID) #omp section (other sections only)omperf_section_end(prevsectionID) #omp sectionomperf_section_begin(sectionID) Work-sharing (sections) #omp end sections nowaitomperf_section_end(lastsectionID) #omp end sections nowaitomperf_sections_end(loopID) #omp end sectionsomperf_section_end(lastsectionID) #omp end sections nowaitomperf_barrier_begin(sectionsID)#omp barrieromperf_barrier_end(sectionsID)omperf_sections_end(sectionsID) Example: work-sharing (sections)

Example: work-sharing (single, master) • Work-sharing (single) • #omp singleomperf_single_enter(singleID)#omp singleomperf_single_begin(singleID) • #omp end single nowaitomperf_single_end(singleID)#omp end single nowait omperf_single_exit(singleID) • #omp end singleomperf_single_end(singleID)#omp end single nowait omperf_barrier_begin(singleID)#omp barrieromperf_barrier_end(singleID)omperf_single_exit(singleID) • Work-sharing (master) • #omp master#omp masteromperf_master_begin(regionID) • #omp end masteromperf_master_end(regionID)#omp end master

Example: synchronization (critical, atomic, lock) • Mutual exclusion (critical section) • #omp criticalomperf_critical_enter(criticalID)#omp criticalomperf_critical_begin(criticalID) • #omp end criticalomperf_critical_end(criticalID)#omp end criticalomperf_critical_exit(criticalID) • Mutual exclusion (automic) • #omp atomicomperf_atomic_begin(atomicID)#omp atomic atomic-expr-stmtomperf_atomic_end(atomicID) • Mutual exclusion (lock routines) • omp_set_lock(lockID)omperf_lock_set(lockID)omp_set_lock(lockID)omperf_lock_acquire(lockID) • omp_unset_lock(lockID)omp_unset_lock(lockID)omperf_lock_unset(lockID) • omp_test_lock(lockID)… • Overhead issues here

Comments • Appropriate transformations for short-cut directives • #omp parallel do #omp parallel sections • Performance initialization and termination routines • omperf_init()omperf_finalize() • User-defined naming to use in context description • New attribute? New directive? Runtime function? • RTL events and information • How to get thread information efficiently? • How to get thread-specific context data? • Supports portability and source-based analysis tools

Other Additions • Support for user-defined events • !$omp perf event ... • #pragma omp perf event … • Place at arbitrary points in program • Translated by compiler into corresponding omperf() • Measurement control • !$omp perf on/off • #pragma omp perf on/off • Place at “consistent” points in program • Translate by compiler into omperf_on/off()

Describing Execution Context (B. Mohr) • Describe different contexts through context descriptor struct region_descr {char name[]; /* region name */char filename[]; /* source file name */int begin_lineno; /* begin line # */int end_lineno; /* end line # */WORD data[4]; /* unspecified data */struct region_descr* next; }; • Generate context descriptors in global static memory: struct region_descr rd42675 = { “r1”, “foo.c”, 5, 13 }; • Table of context descriptors

Describing Execution Context (continued) • Pass descriptor address (or ID) to performance callback • Advantages: • Full context information available, including source reference • But minimal runtime overhead • just one argument needs to be passed • implementation doesn’t need to dynamically allocate memory for performance data!! • context data initialization at compile time • Context data is kept together with executable • avoids problems of locating (the right)separate context description file at runtime

General Issues • Portable performance measurement interface • OMP event-oriented (directives and RTL operation) • Generic “standardized” performance event interface • Not specific to any particular measurement library • Cross-language support • Performance measurement library approach • Profiling and tracing • No built-in (non-portable) measurement • Overheads vs. perturbation • Iteration measurement overhead can be serious • Dynamic instrumentation – is it possible?

TAU Architecture Dynamic

Hybrid Parallel Computation (OpenMPI + MPI) • Portable hybrid parallel programming • OpenMP for shared memory parallel programming • Fork-join model • Loop level parallelism • MPI for cross-box message-based parallelism • OpenMP performance measurement • Interface to OpenMP runtime system (RTS events) • Compiler support and integration • 2D Stommel model of ocean circulation • Jacobi iteration, 5-point stencil • Timothy Kaiser (San Diego Supercomputing Center)

OpenMP + MPI Ocean Modeling (Trace) Threadmessagepairing IntegratedOpenMP +MPI events

OpenMP + MPI Ocean Modeling (HW Profile) % configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc -mpiinc=../packages/mpich/include -mpilib=../packages/mpich/libo IntegratedOpenMP +MPI events FP instructions

Performance Tools Interface for OpenMP: A Presentation to the OpenMP Futures Committee

Performance Tools Interface for OpenMP: A Presentation to the OpenMP Futures Committee

Presentation Transcript

Introduction to OpenMP

Improving OpenMP Performance

OpenMP

OpenMP

OpenMP

OpenMP

Introduction to OpenMP

OpenMP

OpenMP

OpenMP

OpenMP

OpenMP

OpenMP

OpenMP

OpenMP

OpenMP