480 likes | 661 Views
Integrated MPI/OpenMP Performance Analysis. KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com. Outline. Why integrated MPI/OpenMP programming? A performance tool for MPI/OpenMP programming (Phase 1)
E N D
Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com
Outline • Why integrated MPI/OpenMP programming? • A performance tool for MPI/OpenMP programming (Phase 1) • Integrated performance analysis capability for ASCI Apps (Phase 2)
Why Integrate MPI and OpenMP? • Hardware trends • Simple example – How it is done now? • An FEA Example • ASCI Examples
Example recently LLNL ASCI clusters Parallel Capacity Resource (PCR) cluster Three clusters totaling 472 Pentium 4s; the largest with 252 Theoretical peak 857 gigaFLOP/s, Linux NetworX via SGI Federal HPCWire 8/31/01 Parallel global file system cluster Total 48 Pentium 4 processors 1,024 clients/servers Deliver I/O rates of over 32 GB/s Fail-over and global lock manager Linux open source NetworX via SGI Federal HPCWire 8/31/01 Parallel Hardware Keeps Coming
OpenMP Performance tools OpenMP MPI/OpenMP Performance tools Code Performance Debuggers, IDEs Effort Parallelism Performance Analysis MPI
Cost Effective Parallelism Long Term • Wealth of parallelism experience single person codes to large team
ASCI Ultrascale Tools Project • Pathforward project • RTS – Parallel System Performance • Ten Goals in three areas – • Scalability– Work with 10,000+ Processors • Integration – How about Hardware Monitors, Object Oriented, and Runtime Environment? • Ease of Use – Dynamic Instrumentation and Be Prescriptive, not just Data Management
Guide – Source Instrumentation Vampirtrace – MPI/OpenMP Instrumentation Vampir – MPI Analysis GuideView – OpenMP Analysis Architecture for Ultrascale Performance Application Source Guide Guidetrace Library Object Files Vampirtrace Library Executable TraceFile Vampir GuideView
Phase One Goal – Integrated MPI/OpenMP • Phase One Goals – • Integrated MPI OpenMP Tracing • Mode most compatible with ASCI Systems • Whole Program Profiling • Integrate program profile with parallelism • Increased Scalability of Performance Analysis • 1000 processors
Vampir – Integrated MPI/OpenMP SWEEP3D run on 4 MPI tasks with 4 OpenMP Threads each Threaded activity during OpenMP region Timeline shows OpenMP regions with glyph
GuideView – Integrated MPI/OpenMP & Profile SWEEP3D run on 4 MPI tasks each with 4 OpenMP threads All OpenMP regions for process summarized to one bar Highlight (Red arrow) shows speedup curve for that set of threads Thread view shows balance between MPI tasks and threads
GuideView – Integrated MPI/OpenMP & Profile Profile allows comparison of MPI, OpenMP and Application activity inclusive and exclusive Sorting and filtering bring large amounts of information to manageable level
Compilation of OpenMP Automatic subroutine entry and exit instrumentation – Fortran C/C++ New compiler options –WGtrace -- link with the Vampirtrace WGprof -- subroutine entry/exit profiling – WGprof_leafprune minimum size of procedures to retain in profile Guide –Compiler Workhorse
Support for pruning of short routines Vampirtrace –Profiling All events that have not been pruned could now be written to the tracefile. This tree will be pruned. ROUTINE X will be marked as having calltree info summarized. ROUTINE X ENTRY ROUTINE Y ENTRY > Δt < Δt ROUTINE Y EXIT ROUTINE Z may still be < Δt so cannot yet be written. ROUTINE Z ENTRY ˚ ˚ ˚
Scalability on Phase One • Timeline scaling to 256 Tasks/Nodes • Gathering of tasks in node into group • Filtering by nodes • Expand each node • Message statistics by nodes
Phase Two – Integrating Capabilities for ASCI Apps • Phase Two Goals – • Deployment to other platforms – • Compaq, CPlant, SGI • Thread-Safety • Scalability – • Grouping • Statistical Analysis • Integrated GuideView • Hardware performance monitors • Dynamic control of instrumentation • Environmental awareness
Collect data from each thread – Thread-safe Vampirtrace library Per thread profiling data Previous release, only master thread logged data Improves accuracy of data Value to users – Enhances integration between MPI and OpenMP Enhances visibility into functional balance between threads Thread Safety
Up to end of FY00 Fixed hierarchy levels (system, nodes, CPUs) Fixed grouping of processes Eg, Impossible to reflect communicators Need more levels Threads are a fourth group Systems with deeper hierarchies (30T) Reduce number of on-screen entities for scalability Scalability: Grouping Whole system Node 1 Node n Quadboard T_1 T_p CPU 1 CPU c t_c t_1
Default Grouping By Nodes By Processes By Master Threads All Threads Can be changed in configuration file Default Grouping
Filter processes dialog Select groups combo-box Display of groups By aggregation By representative Grouping applies to “Timeline bars” Counter streams Scalability: Grouping
Scalability by Grouping Parallelism display showing all threads Parallelism display showing only master threads alternating between MPI and OpenMP parallelism
Collects basic statistics at runtime Saves statistics in an ASCII-file View statistics your favorite spreadsheet ... Reduced overhead compared to tracing Tracefile(big) Statistical Information Gathering Parallel Executable Statsfile(small) Perl filter Excel, ...
Can work independent of tracing Significantly lower overhead (memory, runtime) Restriction: for the whole application run ... Statistical Information Gathering
Creating an extension API in Vampir insert menu items include new displays have access to trace data & statistics Vampir menus GuideView Integrated Inside Vampir Vampir GUI engine invoke New GuideView control display access Trace data(in memory) Motif graphics library
Goals – Improve MPI/OpenMP integration Improve scalability Integrate look and feel Works like old GuideView! Load time – Fast! New GuideView Whole Program View
Looks like old Region view turned on the side! Scalability test 16 MPI tasks 16 OpenMP threads 300 Parallel regions New GuideView Region View
User can call HPM API in the source code User can define events in Config file for Guide instrumentation HPM counter events are also logged from Guidetrace and Vampirtrace library Underlying HPM library is PAPI Hardware Performance Monitors Application Source Config File Guide Guidetrace Object Files Vampirtrace Executable PAPI TraceFile Vampir GuideView
Standardizes names across platforms Users define counter sets User could instrument by-hand -- But better, Counters are instrumented at OpenMP and subrs int main(int argc, char **argv) { int set_id; int inner,outer,other; set_id = VT_create_event_set(“MySet”); VT_add_event(set_id,PAPI_L1_DCM); VT_add_event(EventSet,PAPI_L2_DCM); VT_symdef(outer, “OUTER”, “USERSTATES”); VT_symdef(inner, “INNER”, “USERSTATES”); VT_symdef(other, “OTHER”, “USERSTATES”); VT_change_hpm(set_id); VT_begin(outer); foo(); VT_begin(inner); bar(); VT_end(inner); foo(); VT_end(outer); } Create a new event set to measure L1 & L2 data cache misses. PAPI – Hardware Performance Monitors Can’t support unsup-ported counters Activate the event set Collect the events over two user-defined intervals
Hardware Performance Example MPI tasks on timeline Or, per MPI task activity correlated in same window Floating pt instructions correlated but in different window
Hardware Performance Can Be Rich 4 x 4 SWEEP3D run showing L1 Data Cache Miss Cycles Stalled Waiting for Memory Accesses
Hardware Performance in GuideView You can see the HPM data on all GuideView windows L1 data cache misses and stalls in Cycle due to memory stalls in per MPI task profile view
In this menu you can arithmetically combine measured counters into derived counters Derived Hardware Counters Vampir and GuideView displays present derived counters
Select rusage information like HPMs Environmental Counters • Data appears in Vampir and GuideView like HPM data • Time-varying OS counters – • Config variable sets sampling frequency • Difficult to attribute to source code precisely
Type 1: Collects IBM MPI information Treated as static (one time) event in tracefile Over 50 parameters Environmental Awareness
In source, User puts VT_confsync() calls At runtime, TotalView is attachedand breakpoint is inserted From process #0, user adjusts several instrumentation settings VTconfigchanged flag is set, breakpoint is exited, Dynamic Control of Instrumentation Application Source Guide TotalView Object Files Vampirtrace Library Executable TraceFile Vampir Tracefile reflects change after next VT_confsync() GuideView
Structured Trace FilesFrames Manage Scalability A Section of the Timeline A Set of Processors Messages or Collectives OpenMP Regions Instances of a subroutine
Frames are defined In the source code – int VT framedef( char name, unsigned int type mask, int * frame handle ) int VT framestart( int frame handle ) int VT framestop( int frame handle ) Type_mask defines the types of data collected – VT FUNCTION VT REGION VT PAR REGION VT OPENMP VT COUNTER VT MESSAGE VT COLL OP VT COMMUNICATION VT ALL Analyze time frames will be available Structured Trace Files Consist of Frames
Structured Trace FilesRapid Access By Frames Index File Frame Frame Frame Frame 2) Vampir Thumbnail Displays Represent Frames 1) Structured Tracefile 3) Selecting Thumbnails Displays Frames in Vampir
How to avoid SOOX – Instrument with API (Scalability Object Oriented eXplosion) C++ templates, classes make it much easier Can be used with or without source Use TAU model Object Oriented Performance Analysis ImZ VT Activity/ InformerMappings ImY ImX Informers I_C I_B I_A I_D ImQ Events MPI_Recv MPI_Finalize MPI_Send Func A Func X Func Y Func Z Func Init
class Matrix { public: InformerMapping im; Matrix(int rows, int columns) { if (rows * columns > 500) im.Rename(“LargeMatrix”); else im.Rename(“Matrix”); } void invert () { Informer(im, “invert”, 12, 15, “Example.C”); #pragma omp parallel { .... } MPI_send(...); } void compl () { Informer(im, “typeid(…)” ); .... } }; void main(int argc, char **argv) { Matrix A(10,10),B(512,512),C(1000,1000); // line 1 B.im.Rename(“MediumMatrix”); // line 2 A.invert(); // line 3 B.compl(); // line 4 C.invert(); // line 5 } Example of OO Informers Create three Matrix instances: A (mapped to “Matrix” bin), B (mapped to “LargeMatrix” bin), and C (mapped to “LargeMatrix” bin) Remap B to “MediumMatrix” bin A.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in Matrix bin B.compl is traced. Entry and exit events are collected and associated with (“Matrix:void compl(void)”) in MediumMatrix bin C.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in LargeMatrix bin
Vampir OO Timeline Shows Informer Bins InformerMappings: display each bin as a Vampir activity. MPI is put into a separate activity with same prefix Rename as ‘Mangled name’ InformerMapping:Informer:NormalEventName
Vampir OO Profile Shows Informer Bins Time in Classes: Queens MPI Time in Class: Queens
OO GuideView Shows Regions in Bins Time and counter data per thread by Bin
Parallel Performance Engineering • ASCI Ultrascale Performance Tools • Scalability • Integration • Ease of Use • Read about what was presented • ftp://ftp.kai.com/private/Lab_notes_2001.doc.gz • Contact: seon.w.kim@intel.com • Thank you for your attention!