1 / 46

Integrated MPI/OpenMP Performance Analysis

Integrated MPI/OpenMP Performance Analysis. KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com. Outline. Why integrated MPI/OpenMP programming? A performance tool for MPI/OpenMP programming (Phase 1)

barto
Download Presentation

Integrated MPI/OpenMP Performance Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

  2. Outline • Why integrated MPI/OpenMP programming? • A performance tool for MPI/OpenMP programming (Phase 1) • Integrated performance analysis capability for ASCI Apps (Phase 2)

  3. Why Integrate MPI and OpenMP? • Hardware trends • Simple example – How it is done now? • An FEA Example • ASCI Examples

  4. Example recently LLNL ASCI clusters Parallel Capacity Resource (PCR) cluster Three clusters totaling 472 Pentium 4s; the largest with 252 Theoretical peak 857 gigaFLOP/s, Linux NetworX via SGI Federal HPCWire 8/31/01 Parallel global file system cluster Total 48 Pentium 4 processors 1,024 clients/servers Deliver I/O rates of over 32 GB/s Fail-over and global lock manager Linux open source NetworX via SGI Federal HPCWire 8/31/01 Parallel Hardware Keeps Coming

  5. OpenMP Performance tools OpenMP MPI/OpenMP Performance tools Code Performance Debuggers, IDEs Effort Parallelism Performance Analysis MPI

  6. Cost Effective Parallelism Long Term • Wealth of parallelism experience single person codes to large team

  7. ASCI Ultrascale Tools Project • Pathforward project • RTS – Parallel System Performance • Ten Goals in three areas – • Scalability– Work with 10,000+ Processors • Integration – How about Hardware Monitors, Object Oriented, and Runtime Environment? • Ease of Use – Dynamic Instrumentation and Be Prescriptive, not just Data Management

  8. Guide – Source Instrumentation Vampirtrace – MPI/OpenMP Instrumentation Vampir – MPI Analysis GuideView – OpenMP Analysis Architecture for Ultrascale Performance Application Source  Guide Guidetrace Library Object Files  Vampirtrace Library Executable TraceFile  Vampir  GuideView

  9. Phase One Goal – Integrated MPI/OpenMP • Phase One Goals – • Integrated MPI OpenMP Tracing • Mode most compatible with ASCI Systems • Whole Program Profiling • Integrate program profile with parallelism • Increased Scalability of Performance Analysis • 1000 processors

  10. Vampir – Integrated MPI/OpenMP SWEEP3D run on 4 MPI tasks with 4 OpenMP Threads each Threaded activity during OpenMP region Timeline shows OpenMP regions with glyph

  11. GuideView – Integrated MPI/OpenMP & Profile SWEEP3D run on 4 MPI tasks each with 4 OpenMP threads All OpenMP regions for process summarized to one bar Highlight (Red arrow) shows speedup curve for that set of threads Thread view shows balance between MPI tasks and threads

  12. GuideView – Integrated MPI/OpenMP & Profile Profile allows comparison of MPI, OpenMP and Application activity inclusive and exclusive Sorting and filtering bring large amounts of information to manageable level

  13. Compilation of OpenMP Automatic subroutine entry and exit instrumentation – Fortran C/C++ New compiler options –WGtrace -- link with the Vampirtrace WGprof -- subroutine entry/exit profiling – WGprof_leafprune minimum size of procedures to retain in profile Guide –Compiler Workhorse

  14. Support for pruning of short routines Vampirtrace –Profiling All events that have not been pruned could now be written to the tracefile. This tree will be pruned. ROUTINE X will be marked as having calltree info summarized. ROUTINE X ENTRY ROUTINE Y ENTRY > Δt < Δt ROUTINE Y EXIT ROUTINE Z may still be < Δt so cannot yet be written. ROUTINE Z ENTRY ˚ ˚ ˚

  15. Scalability on Phase One • Timeline scaling to 256 Tasks/Nodes • Gathering of tasks in node into group • Filtering by nodes • Expand each node • Message statistics by nodes

  16. Phase Two – Integrating Capabilities for ASCI Apps • Phase Two Goals – • Deployment to other platforms – • Compaq, CPlant, SGI • Thread-Safety • Scalability – • Grouping • Statistical Analysis • Integrated GuideView • Hardware performance monitors • Dynamic control of instrumentation • Environmental awareness

  17. Collect data from each thread – Thread-safe Vampirtrace library Per thread profiling data Previous release, only master thread logged data Improves accuracy of data Value to users – Enhances integration between MPI and OpenMP Enhances visibility into functional balance between threads Thread Safety

  18. Up to end of FY00 Fixed hierarchy levels (system, nodes, CPUs) Fixed grouping of processes Eg, Impossible to reflect communicators Need more levels Threads are a fourth group Systems with deeper hierarchies (30T) Reduce number of on-screen entities for scalability Scalability: Grouping Whole system Node 1 Node n Quadboard T_1 T_p CPU 1 CPU c t_c t_1

  19. Default Grouping By Nodes By Processes By Master Threads All Threads Can be changed in configuration file Default Grouping

  20. Filter processes dialog Select groups combo-box Display of groups By aggregation By representative Grouping applies to “Timeline bars” Counter streams Scalability: Grouping

  21. Scalability by Grouping Parallelism display showing all threads Parallelism display showing only master threads alternating between MPI and OpenMP parallelism

  22. Collects basic statistics at runtime Saves statistics in an ASCII-file View statistics your favorite spreadsheet ... Reduced overhead compared to tracing Tracefile(big) Statistical Information Gathering Parallel Executable Statsfile(small) Perl filter Excel, ...

  23. Can work independent of tracing Significantly lower overhead (memory, runtime) Restriction: for the whole application run ... Statistical Information Gathering

  24. Statistical Information Gathering

  25. Creating an extension API in Vampir insert menu items include new displays have access to trace data & statistics Vampir menus GuideView Integrated Inside Vampir Vampir GUI engine invoke New GuideView control display access Trace data(in memory) Motif graphics library

  26. Goals – Improve MPI/OpenMP integration Improve scalability Integrate look and feel Works like old GuideView! Load time – Fast! New GuideView Whole Program View

  27. Looks like old Region view turned on the side! Scalability test 16 MPI tasks 16 OpenMP threads 300 Parallel regions New GuideView Region View

  28. User can call HPM API in the source code User can define events in Config file for Guide instrumentation HPM counter events are also logged from Guidetrace and Vampirtrace library Underlying HPM library is PAPI Hardware Performance Monitors  Application Source  Config File Guide Guidetrace  Object Files Vampirtrace  Executable PAPI TraceFile Vampir GuideView

  29. Standardizes names across platforms Users define counter sets User could instrument by-hand -- But better, Counters are instrumented at OpenMP and subrs int main(int argc, char **argv) { int set_id; int inner,outer,other; set_id = VT_create_event_set(“MySet”); VT_add_event(set_id,PAPI_L1_DCM); VT_add_event(EventSet,PAPI_L2_DCM); VT_symdef(outer, “OUTER”, “USERSTATES”); VT_symdef(inner, “INNER”, “USERSTATES”); VT_symdef(other, “OTHER”, “USERSTATES”); VT_change_hpm(set_id); VT_begin(outer); foo(); VT_begin(inner); bar(); VT_end(inner); foo(); VT_end(outer); } Create a new event set to measure L1 & L2 data cache misses. PAPI – Hardware Performance Monitors Can’t support unsup-ported counters Activate the event set Collect the events over two user-defined intervals

  30. Hardware Performance Example MPI tasks on timeline Or, per MPI task activity correlated in same window Floating pt instructions correlated but in different window

  31. Hardware Performance Can Be Rich 4 x 4 SWEEP3D run showing L1 Data Cache Miss Cycles Stalled Waiting for Memory Accesses

  32. Hardware Performance in GuideView You can see the HPM data on all GuideView windows L1 data cache misses and stalls in Cycle due to memory stalls in per MPI task profile view

  33. In this menu you can arithmetically combine measured counters into derived counters Derived Hardware Counters Vampir and GuideView displays present derived counters

  34. Select rusage information like HPMs Environmental Counters • Data appears in Vampir and GuideView like HPM data • Time-varying OS counters – • Config variable sets sampling frequency • Difficult to attribute to source code precisely

  35. Type 1: Collects IBM MPI information Treated as static (one time) event in tracefile Over 50 parameters Environmental Awareness

  36. In source, User puts VT_confsync() calls At runtime, TotalView is attachedand breakpoint is inserted From process #0, user adjusts several instrumentation settings VTconfigchanged flag is set, breakpoint is exited, Dynamic Control of Instrumentation  Application Source Guide TotalView  Object Files Vampirtrace Library  Executable  TraceFile Vampir Tracefile reflects change after next VT_confsync() GuideView

  37. Dynamic Control of Instrumentation

  38. Structured Trace FilesFrames Manage Scalability A Section of the Timeline A Set of Processors Messages or Collectives OpenMP Regions Instances of a subroutine

  39. Frames are defined In the source code – int VT framedef( char name, unsigned int type mask, int * frame handle ) int VT framestart( int frame handle ) int VT framestop( int frame handle ) Type_mask defines the types of data collected – VT FUNCTION VT REGION VT PAR REGION VT OPENMP VT COUNTER VT MESSAGE VT COLL OP VT COMMUNICATION VT ALL Analyze time frames will be available Structured Trace Files Consist of Frames

  40. Structured Trace FilesRapid Access By Frames Index File Frame Frame Frame Frame 2) Vampir Thumbnail Displays Represent Frames 1) Structured Tracefile 3) Selecting Thumbnails Displays Frames in Vampir

  41. How to avoid SOOX – Instrument with API (Scalability Object Oriented eXplosion) C++ templates, classes make it much easier Can be used with or without source Use TAU model Object Oriented Performance Analysis ImZ VT Activity/ InformerMappings ImY ImX Informers I_C I_B I_A I_D ImQ Events MPI_Recv MPI_Finalize MPI_Send Func A Func X Func Y Func Z Func Init

  42. class Matrix { public: InformerMapping im; Matrix(int rows, int columns) { if (rows * columns > 500) im.Rename(“LargeMatrix”); else im.Rename(“Matrix”); }   void invert () { Informer(im, “invert”, 12, 15, “Example.C”); #pragma omp parallel { .... } MPI_send(...); }   void compl () { Informer(im, “typeid(…)” ); .... } }; void main(int argc, char **argv) { Matrix A(10,10),B(512,512),C(1000,1000); // line 1 B.im.Rename(“MediumMatrix”); // line 2 A.invert(); // line 3 B.compl(); // line 4 C.invert(); // line 5 } Example of OO Informers Create three Matrix instances: A (mapped to “Matrix” bin), B (mapped to “LargeMatrix” bin), and C (mapped to “LargeMatrix” bin) Remap B to “MediumMatrix” bin A.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in Matrix bin B.compl is traced. Entry and exit events are collected and associated with (“Matrix:void compl(void)”) in MediumMatrix bin C.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in LargeMatrix bin

  43. Vampir OO Timeline Shows Informer Bins InformerMappings: display each bin as a Vampir activity. MPI is put into a separate activity with same prefix Rename as ‘Mangled name’ InformerMapping:Informer:NormalEventName

  44. Vampir OO Profile Shows Informer Bins Time in Classes: Queens MPI Time in Class: Queens

  45. OO GuideView Shows Regions in Bins Time and counter data per thread by Bin

  46. Parallel Performance Engineering • ASCI Ultrascale Performance Tools • Scalability • Integration • Ease of Use • Read about what was presented • ftp://ftp.kai.com/private/Lab_notes_2001.doc.gz • Contact: seon.w.kim@intel.com • Thank you for your attention!

More Related