390 likes | 498 Views
Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting. Bernd Mohr , Felix Wolf Forschungszentrum Jülich John von Neumann - Institut für Computing Zentralinstitut für Angewandte Mathematik 52425 Jülich {b.mohr,f.wolf}@fz-juelich.de.
E N D
Towards a Performance Tool Interface for OpenMP:An Approach Based onDirective Rewriting Bernd Mohr, Felix Wolf Forschungszentrum Jülich John von Neumann - Institut für Computing Zentralinstitut für Angewandte Mathematik 52425 Jülich {b.mohr,f.wolf}@fz-juelich.de Allen Malony, Sameer Shende University of Oregon Department of Computer and Information Science Eugene, Oregon 97403 {malony,sameer}@cs.uoregon.edu
Outline • Introduction • Proposed OpenMP Performance Tool Interface • Prototype Implementation • Examples • Future Work
Introduction • Motivation • “Standard” OpenMP performance tools interfacesimilar in spirit to the MPI profiling interface (PMPI)” • Goals • Expose OpenMP parallel execution to theperformance measurement system • Define it at the abstraction level of theOpenMP programming model • Make the performance measurement interface portable • across different platforms • across all OpenMP supported languages • different performance tools • Allow flexibility in how the interface is applied
Proposed OpenMP Performance Tool Interface • POMP • OpenMP Directive Instrumentation • OpenMP Runtime Library Routine Instrumentation • Performance Monitoring Library Control • User Code Instrumentation • Context Descriptors • Conditional Compilation • Conditional / Selective Transformations • Remarks • C/C++ OpenMP Pragma Instrumentation • Implementation Issues • Open Issues
OpenMP Directive Instrumentation • Insert calls topomp_NAME_TYPE(d)at appropriate places around directives • NAME name of the OpenMP construct • TYPE • fork, join mark change in parallelism grade • enter, exit flag entering/exiting OpenMP construct • begin, end mark start/end of body of construct • d context descriptor • Observation of implicit barrier atDO, SECTIONS, WORKSHARE, SINGLE constructs • Add NOWAIT to construct • Make barrier explicit
Example: !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses...do loop!$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses,lastprivate-clausesdo loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT!$OMP BARRIER call pomp_parallel_fork(d)call pomp_parallel_begin(d)call pomp_parallel_end(d)call pomp_parallel_join(d) call pomp_do_enter(d)call pomp_do_exit(d) call pomp_barrier_enter(d)call pomp_barrier_exit(d)
OpenMP Runtime Library Routine Instrumentation • Transform • omp_###_lock()pomp_###_lock() • omp_###_nest_lock()pomp_###_nest_lock() [ ### = init | destroy | set | unset | test ] • POMP version • Calls omp version internally • Can do extra stuff before and after call • Transformations of other OpenMP API functions necessary?
Performance Monitoring Library Control • Give programmer control over performance monitoringat runtime • !$OMP INST [ INIT | FINALIZE | ON | OFF ] • Translated into • pomp_init(), pomp_finalize() • pomp_on(), pomp_off() • Ignored in “normal” OpenMP compilation mode • Alternatives • !$POMP? • Use conditional compilation with explicit POMP calls
User Code Instrumentation • Compiler / transformation tool should insert • pomp_begin(d) • pomp_end(d) calls at beginning and end of each(?) user function • Allow user-specified arbitrary (non-function) code regions • !$OMP INST BEGIN ( <region name> )arbitrary user code!$OMP INST END ( <region name> ) • Alternatives • !$POMP? • Use conditional compilation with explicit POMP calls descriptor?
Context Descriptors • Describe execution contexts through context descriptor typedef struct ompregdescr {char name[];/* construct */char sub_name[];/* region name */int num_sections;char filename[];/* src filename */int begin_line1, begin_lineN;/* begin line # */int end_line1, end_lineN;/* end line # */WORD data[4];/* perf. data */struct ompregdescr* next;} OMPRegDescr; • Generate context descriptors in global static memory: OMPRegDescr rd42675 = { "critical", "phase1", 0, "foo.c", 5, 5, 13, 13 }; • Pass address to POMP functions
Conditional Compilation • C, C++, [Fortran, if supported] • #ifdef _POMParbitrary user code#endif • Fortran Free Form • !P$ arbitrary user code • Fortran Fixed Form • CP$ arbitrary *P$ user !P$ code • Usual restrictions apply
Conditional / Selective Transformations • (Temporarily) disable / re-enable POMP instrumentationat compile time • !$OMP NOINSTRUMENT • !$OMP INSTRUMENT • Alternative: • !$POMP?
C/C++ OpenMP Pragma Instrumentation • No END pragmas • instrumentation for “closing” part follows structured block • adding nowait has to be done in the “opening part” • #pragma omp XXX structured block; • Simple differences in language • no “call” keyword • “;” • !$OMP#pragma omp pomp_###_begin(d); pomp_###_end(d); { }
Example: #pragma omp sections Instrumentation #pragma omp sections{ #pragma omp sectionstructured block; #pragma omp sectionstructured block;} pomp_sections_enter(d);{ pomp_section_begin(d);pomp_section_end(d); }{ pomp_section_begin(d);pomp_section_end(d); }pomp_sections_exit(d); nowait#pragma omp barrier pomp_barrier_enter(d);pomp_barrier_exit(d);
Implementation Issues • pomp_NAME_TYPE(d) more efficient / simpler than pomp_event(POMP_TYPE, POMP_NAME, fname, line#, ...) • Inlining of POMP calls possible • Context descriptors • Full context information available, incl. source reference • But minimal runtime overhead • just one argument needs to be passed • no need to dynamically allocate memory for data!! • context data initialization at compile time • Context data is kept together with executable • Allows for separate compilation • Potentially too much overhead for ATOMIC, CRITICAL, MASTER, SINGLE, and OpenMP lock calls--pomp-disable=construct-list
Open Issues • ORDERED? • FLUSH? • Instrumentation of PARALLELDO / FOR loop iterations • Potentially allows measurement of influence of loop scheduling policies • Overhead?? • Allow passing additional user information to POMP library • Conditional compilation • Extra parameter to !$OMPINSTBEGIN/END • ... • Specification of extent of user code instrumentation • Additional pragmas/directives? • Separate (outside source code) specification? • OpenMP Runtime Instrumentation necessary?
Prototype Implementation: OPARI • OpenMP Pragma And Region Instrumentor (OPARI) • Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions • Supports • Fortran77 and Fortran90, OpenMP 2.0 • C and C++, OpenMP 1.0 • Runtime Library Control (init, finalize, on, off) • (Manual) User Code Instrumentation (begin, end) • Conditional Compilation (#ifdef _POMP, !P$) • Conditional / Selective Transformation ([no]instrument) • Preserves source code information (#line linefile) • ~ 2000 lines of C++ code
OPARI • Limitations • Fortran: • ENDDO and ENDPARALLELDO directives required • atomic expression on line by itself • C/C++: • structured blocks: simple expression statement or block (compound statement) • Exception: for statement after parallelfor • Could be fixed by enhancing OPARI’s parsing capabilities • Source code and documentation available athttp://www.fz-juelich.de/zam/kojak/opari/
Prototype Implementation: POMP Library • EXtensible PERformance Tool (EXPERT) • Automatic event trace analyzer • http://www.fz-juelich.de/zam/kojak/expert/ • Tuning and Analysis Utilities (TAU) • Performance analysis framework • http://www.acl.lanl.gov/tau/ • Required ~ 1 day to implement tool specific POMP libraries
Prototype Implementation: EXPERT POMP Library void pomp_for_enter(OMPRegDescr* r) { /* Get EPILOG region descriptor stored in r */ElgRegion* e = (ElgRegion*)(r->data[0]); /* If not yet there, initialize and store it */if (! e) e = ElgRegion_Init(r); /* Record enter event */ elg_enter(e->rid);} void pomp_for_exit(OMPRegDescr* r) { /* Record collective exit event */ elg_omp_collexit();}
Prototype Implementation: TAU POMP Library TAU_GLOBAL_TIMER(tfor, "for enter/exit", "[OpenMP]", OpenMP); void pomp_for_enter(OMPRegDescr* r) {#ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_START(tfor);#endif #ifdef TAU_OPENMP_REGION_VIEW TauStartOpenMPRegionTimer();#endif} void pomp_for_exit(OMPRegDescr* r) { ...}
Examples • EXPERT • REMO: Weather Forecast • DKRZ Germany • MPI + OpenMP (experimental) • TAU • Stommel: Ocean Circulation Simulation • SDSC • MPI + OpenMP • event trace based Vampir • profile based RACY
Future Work • Measure typical POMP calling overhead • EPCC OpenMP Microbenchmarks? • Investigate “formal” standardization with OpenMP forum[OpenMP Supplemental Standard?] • OpenMP programmers • What do you expect from an OpenMP performance tool? • Tool developers: • Download and try out OPARI • Implement POMP interface for your tool • Tell us about problems, comments, enhancements • OpenMP ARB members • What do we need to do next?
Conclusion • POMP OpenMP Performance Tool Interface • Portable • Flexible • Efficient • Defined at the abstraction level of theOpenMP programming model • Standard? • Prototype Software • OpenMP Pragma And Region Instrumentor (OPARI)http://www.fz-juelich.de/zam/kojak/opari/ • Tuning and Analysis Utilities (TAU)http://www.acl.lanl.gov/tau/
!$OMP PARALLEL Instrumentation call pomp_parallel_fork(d)!$OMP PARALLELcall pomp_parallel_begin(d)structured blockcall pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d)call pomp_parallel_end(d)!$OMP END PARALLELcall pomp_parallel_join(d)
!$OMP DO Instrumentation call pomp_do_enter(d)!$OMP DOdo loop!$OMP END DO NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_do_exit(d)
!$OMP WORKSHARE Instrumentation call pomp_workshare_enter(d)!$OMP WORKSHAREstructured block!$OMP END WORKSHARE NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_workshare_exit(d)
!$OMP SECTIONS Instrumentation call pomp_sections_enter(d)!$OMP SECTIONS!$OMP SECTIONcall pomp_section_begin(d)structured blockcall pomp_section_end(d)!$OMP SECTIONcall pomp_section_begin(d)structured blockcall pomp_section_end(d)!$OMP END SECTIONS NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_sections_exit(d)
Synchronization Constructs Instrumentation 1 call pomp_single_enter(d)!$OMP SINGLEcall pomp_single_begin(d)structured blockcall pomp_single_end(d)!$OMP END SINGLE NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_single_exit(d)!$OMP MASTERcall pomp_master_begin(d)structured blockcall pomp_master_end(d)!$OMP END MASTER
Synchronization Constructs Instrumentation 2 call pomp_critical_enter(d)!$OMP CRITICALcall pomp_critical_begin(d)structured blockcall pomp_critical_end(d)!$OMP END CRITICALcall pomp_sections_exit(d)call pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_atomic_enter(d)!$OMP ATOMICatomic expressioncall pomp_atomic_exit(d)
Automatic Analysis • EXtensible PERformance Tool (EXPERT) • programmable, extensible, flexible performanceproperty specification • based on event patterns • analyzes along three hierarchical dimensions • performance properties (general specific) • dynamic call tree position • location (machine node process thread) • Done: fully functional demonstration prototype • Work in Progress: • optimization / generalization • more performance properties • source code and time line displays
100 main 10 main 30 foo 60 bar Expert Result Presentation • Interconnectedweighted treebrowser • scalable still accurate • Each node has weight • Percentage of CPU allocation time • i.e. time spent in subtree of call tree • Displayed weight depends on state of node • Collapsed (including weight of descendants) • Expanded (without weight of descendants) • Displayed using • Color: allows to easily identify hot spots (bottlenecks) • Numerical value: Detailed comparison
Fine: OpenMP +MPI Fine: OpenMP +MPI Performance Properties View Fine: User code Main Problem: Idle Threads
Dynamic Call Tree View 3rd Optimization Opportunity 1st Optimization Opportunity 2nd Optimization Opportunity
Locations View • Supports locationsup to Grid scale • Easily allows explorationof load balance problemson different levels • [ Of course, Idle Thread Problem only applies to slave threads ]