1 / 22

Performance Tools Interface for OpenMP: A Presentation to the OpenMP Futures Committee

This presentation outlines the goals, performance state and event model, and performance measurement model for a performance tools interface for OpenMP. It also discusses event generation interface, proposal based on directive transformation, and experience using TAU with OpenMP/MPI.

jamesleal
Download Presentation

Performance Tools Interface for OpenMP: A Presentation to the OpenMP Futures Committee

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Tools Interface for OpenMPA presentation to the OpenMP Futures Committee Allen D. Malony malony@cs.uoregon.edu Computer & Information Science Department Computational Science Institute University of Oregon

  2. Outline • Goals for OpenMP performance tools interface • Performance state and event model • Fork-join execution states and events • Performance measurement model • Event generation (callback) interface • Proposal based on directive transformation • Sample transformations • Comments and other additions • Describing execution context • General Issues • Experience using TAU with OpenMP/MPI

  3. Goals for an OMP Performance Tools Interface • Goal 1: Expose OpenMP events and execution states to a performance measurement system • What are the OpenMP events / states of interest? • What is the nature (mechanism) of the interface? • Goal 2: Make the performance measurement interface portable • “Standardize” on interface mechanism • Define interface semantics and information • Goal 3: Support source-level and compiler-level implementation of interface • Source transformation and compiler transformation

  4. Performance State and Event Model • Based on performance model for (nested) fork-join parallelism, multi-threaded work-sharing, and thread-based synchronization • Define with respect to multi-level state view • Level 1: serial and parallel states (with nesting) • Level 2: work-sharing states (per team thread) • Level 3: synchronization states (per team thread) • Level 4: runtime system (thread) states • Events reflect state transitions • State enter / exit (begin / end) • State graph with event edges

  5. Fork-Join Execution States and Events Events masterslave Parallel region operation master starts serial execution X S parallel region begins X slaves started X team begins parallel execution X X P team threads hit barrier X X slaves end; master exits barrier X X master resumes serial execution X S

  6. Performance Measurement Model • Serial performance • Detect serial transition points • Standard events and statistics within serial regions • Time spent in serial execution • Locations of serial execution in program • Parallel performance • Detect parallel transitions points • Time spent in parallel execution • Region perspective and work-sharing perspective • Performance profiles kept per region • More complex parallel states of execution

  7. Event Generation (Callback) Interface • Generic event callback function (pseudo format) • omperf(eventID, contextID[, data]) • Single callback routine • Must define events (not necessarily standardize) • Place burden on callback routine to interpret eventID • omperf_{begin/end}(eventID, contextID[, data]) • Directive-specific callback functions (pseudo format) • omperf_{directive}_{begin/end/…}(contextID[, data]) • Standardize function names • What about execution context data?

  8. Instrumentation Alternatives • Source-level instrumentation • Manual instrumentation • TAU performance measurement • Directive transformation • Compiler instrumentation • Could allow more efficient implementation • JOMP (EPCC), Java Instrumentation Suite (Barcelona) • Runtime system instrumentation • Use to see RTL events, in addition to OMP events • GuideView (KAI/Intel) • Dynamic instrumentation

  9. Proposal Based on Directive Transformation • Consider source-level approach • For each OMP directive, generate an “instrumented” version which calls the performance event API. • What is the event model for each directive? • Issues • OMP RTL execution behavior is not fully exposed • May not be able to generate equivalent form • Possible conflicts with directive optimization • May be less efficient • Hard to access RTL events and information • Sample transformations (B. Mohr, KFA)

  10. Parallel region (parallel) #omp parallelomperf_parallel_fork(regionID)#omp parallelomperf_parallel_begin(regionID) #omp end parallelomperf_parallel_end(regionID)omperf_barrier_begin(regionID)#omp barrieromperf_barrier_end(regionID)#omp end parallel omperf_parallel_join(regionID) #omp is just pseudo notation IDs vs. context descriptor (see below) Work-sharing (do/for) #omp doomperf_do_begin(loopID)#omp do #omp end do nowait#omp end do nowaitomperf_do_end(loopID) #omp end do#omp end do nowait omperf_do_end(loopID)omperf_barrier_begin(loopID)#omp barrieromperf_barrier_end(loopID) Example: parallel regions, work-sharing (do)

  11. Work-sharing (sections) #omp sectionsomperf_sections_begin(sectionsID)#omp sections #omp section (first section only)#omp sectionomperf_section_begin(sectionID) #omp section (other sections only)omperf_section_end(prevsectionID) #omp sectionomperf_section_begin(sectionID) Work-sharing (sections) #omp end sections nowaitomperf_section_end(lastsectionID) #omp end sections nowaitomperf_sections_end(loopID) #omp end sectionsomperf_section_end(lastsectionID) #omp end sections nowaitomperf_barrier_begin(sectionsID)#omp barrieromperf_barrier_end(sectionsID)omperf_sections_end(sectionsID) Example: work-sharing (sections)

  12. Example: work-sharing (single, master) • Work-sharing (single) • #omp singleomperf_single_enter(singleID)#omp singleomperf_single_begin(singleID) • #omp end single nowaitomperf_single_end(singleID)#omp end single nowait omperf_single_exit(singleID) • #omp end singleomperf_single_end(singleID)#omp end single nowait omperf_barrier_begin(singleID)#omp barrieromperf_barrier_end(singleID)omperf_single_exit(singleID) • Work-sharing (master) • #omp master#omp masteromperf_master_begin(regionID) • #omp end masteromperf_master_end(regionID)#omp end master

  13. Example: synchronization (critical, atomic, lock) • Mutual exclusion (critical section) • #omp criticalomperf_critical_enter(criticalID)#omp criticalomperf_critical_begin(criticalID) • #omp end criticalomperf_critical_end(criticalID)#omp end criticalomperf_critical_exit(criticalID) • Mutual exclusion (automic) • #omp atomicomperf_atomic_begin(atomicID)#omp atomic atomic-expr-stmtomperf_atomic_end(atomicID) • Mutual exclusion (lock routines) • omp_set_lock(lockID)omperf_lock_set(lockID)omp_set_lock(lockID)omperf_lock_acquire(lockID) • omp_unset_lock(lockID)omp_unset_lock(lockID)omperf_lock_unset(lockID) • omp_test_lock(lockID)… • Overhead issues here

  14. Comments • Appropriate transformations for short-cut directives • #omp parallel do #omp parallel sections • Performance initialization and termination routines • omperf_init()omperf_finalize() • User-defined naming to use in context description • New attribute? New directive? Runtime function? • RTL events and information • How to get thread information efficiently? • How to get thread-specific context data? • Supports portability and source-based analysis tools

  15. Other Additions • Support for user-defined events • !$omp perf event ... • #pragma omp perf event … • Place at arbitrary points in program • Translated by compiler into corresponding omperf() • Measurement control • !$omp perf on/off • #pragma omp perf on/off • Place at “consistent” points in program • Translate by compiler into omperf_on/off()

  16. Describing Execution Context (B. Mohr) • Describe different contexts through context descriptor struct region_descr {char name[]; /* region name */char filename[]; /* source file name */int begin_lineno; /* begin line # */int end_lineno; /* end line # */WORD data[4]; /* unspecified data */struct region_descr* next; }; • Generate context descriptors in global static memory: struct region_descr rd42675 = { “r1”, “foo.c”, 5, 13 }; • Table of context descriptors

  17. Describing Execution Context (continued) • Pass descriptor address (or ID) to performance callback • Advantages: • Full context information available, including source reference • But minimal runtime overhead • just one argument needs to be passed • implementation doesn’t need to dynamically allocate memory for performance data!! • context data initialization at compile time • Context data is kept together with executable • avoids problems of locating (the right)separate context description file at runtime

  18. General Issues • Portable performance measurement interface • OMP event-oriented (directives and RTL operation) • Generic “standardized” performance event interface • Not specific to any particular measurement library • Cross-language support • Performance measurement library approach • Profiling and tracing • No built-in (non-portable) measurement • Overheads vs. perturbation • Iteration measurement overhead can be serious • Dynamic instrumentation – is it possible?

  19. TAU Architecture Dynamic

  20. Hybrid Parallel Computation (OpenMPI + MPI) • Portable hybrid parallel programming • OpenMP for shared memory parallel programming • Fork-join model • Loop level parallelism • MPI for cross-box message-based parallelism • OpenMP performance measurement • Interface to OpenMP runtime system (RTS events) • Compiler support and integration • 2D Stommel model of ocean circulation • Jacobi iteration, 5-point stencil • Timothy Kaiser (San Diego Supercomputing Center)

  21. OpenMP + MPI Ocean Modeling (Trace) Threadmessagepairing IntegratedOpenMP +MPI events

  22. OpenMP + MPI Ocean Modeling (HW Profile) % configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc -mpiinc=../packages/mpich/include -mpilib=../packages/mpich/libo IntegratedOpenMP +MPI events FP instructions

More Related