1 / 39

Analysis Infrastructure for CQoS using TAU

Analysis Infrastructure for CQoS using TAU. Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory, NeuroInformatics Center University of Oregon. Acknowledgement.

ganesa
Download Presentation

Analysis Infrastructure for CQoS using TAU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis Infrastructure for CQoS using TAU Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory, NeuroInformatics Center University of Oregon

  2. Acknowledgement • Jaideep Ray, SNL • Lois McIness, ANL • David Bernholdt, ORNL • Boyana Norris, ANL • Robert Yelle, U. Oregon

  3. Outline • Motivation: CQoS • Instrumentation • Measurement • Analysis tools

  4. S CQoS in GAMESS • Robert Yelle, PRL, U. Oregon ryelle@uoregon.edu • Calculate the energy of Thiophene molecule using different algorithms FINAL U-B3LYP ENERGY IS -552.9083139587 AFTER 21 ITERATIONS FINAL U-BLYP ENERGY IS -552.9861184848 AFTER 22 ITERATIONS FINAL UHF ENERGY IS -551.3483315053 AFTER 11 ITERATIONS FINAL U-SVWN ENERGY IS -550.2734639639 AFTER 22 ITERATIONS

  5. TAU Performance System Framework • Tuning and Analysis Utilities • Performance system framework for scalable parallel and distributed high-performance computing • Targets a general complex system computation model • nodes / contexts / threads • Multi-level: system / software / parallelism • Measurement and analysis abstraction • Integrated toolkit for performance instrumentation, measurement, analysis, and visualization • Portable, configurable performance profiling/tracing facility • Open software approach • University of Oregon, LANL, FZJ Germany • http://www.cs.uoregon.edu/research/paracomp/tau

  6. TAU Performance System Architecture event selection

  7. Performance Evaluation Alternatives Depthlimit profile Callpath/callgraph profile Parameter profile Trace Phase profile Flat profile • Each alternative has: • one metric/counter • multiple counters Volume of performance data

  8. Enhancements in TAU to support CQoS • Instrumentation • Runtime MPI wrapper interposition for CCA framework instrumentation • Automatic proxy component creation for classic and SIDL components • PDT v3.10 (coming, beta released) supports EDG v3.8 for better C/C++ parsing support (GNU extensions, BOOST, ASM statements) • Profile Measurement • Parameter based profiling to capture application data • Context Events to capture callpath with user • Support for memory profiling and memory leak detection • Timestamped profile snapshots (coming) • Analysis • Extensions to PerfDMF to support model storage • Application specific metadata • ParaProf extensions to display profile snapshots, parameter based profiles • PerfExplorer data mining framework • Web based access to performance database via a TAU portal • Ability to store images, share data, metadata

  9. TAU’s CCA Performance Component: Core API • Measurement port and interfaces • Timer • set name/type/group • start/stop • Phase • set name/type/group • start/stop • Control • enable/disable groups • Query • get timer names, get metric names, get user-defined event names • get timer data, get user-defined event data, dump data to disk • Event • set name, trigger event • Context Event (callpath of routines + user event information) • set name, trigger event • MemoryTracker and MemoryHeadroomTracker • enable interrupt tracking, track memory/headroom here, set interrupt interval • enable/disable tracking memory/headroom

  10. TAU’s CCA Interfaces • Performance evaluation using Performance component • Uses underlying TAU library for measurement • Timer, Phase, Event/ContextEvent, Control, Query, MemoryTracker/MemoryHeadroomTracker interfaces • Lightweight instrumentation option • Performance modeling using Mastermind component • Tracks per-invocation performance data • Associates performance data with application data • Method arguments logged with performance data • Callpath information • Helps us build performance models • Updated performance component 1.7.2 released Jan. ’07

  11. interface Timer { /* Start/stop the Timer */ void start(); void stop(); /* Set/get the Timer name */ void setName(in string name); string getName(); /* Set/get Timer type information (e.g., signature of the routine) */ void setType(in string name); string getType(); /* Set/get the group name associated with the Timer */ void setGroupName(in string name); string getGroupName(); /* Set/get the group id associated with the Timer */ void setGroupId(in long group); long getGroupId(); } interface Measurement extends gov.cca.Port { /* Create a Timer */ Timer createTimer(); Timer createTimerWithName(in string name); Timer createTimerWithNameType(in string name, in string type); Timer createTimerWithNameTypeGroup(in string name, in string type, in string group); interface Phase { /* Start/stop the Phase */ void start(); void stop(); /* Set/get the Phase name */ void setName(in string name); string getName(); /* Set/get Phase type information (e.g., signature of the routine) */ void setType(in string name); string getType(); /* Set/get the group name associated with the Phase */ void setGroupName(in string name); string getGroupName(); /* Set/get the group id associated with the Phase */ void setGroupId(in long group); long getGroupId(); } interface Measurement extends gov.cca.Port { /* Create a Phase */ Phase createPhase(); Phase createPhaseWithName(in string name); Phase createPhaseWithNameType(in string name, in string type); Phase createPhaseWithNameTypeGroup(in string name, in string type, in string group); Phase Interface

  12. IntegratorPort MidpointIntegrator Measurement Proxy Component • Interpose a proxy component for each port • Inside the proxy • Make calls to Performance component for each invocation Go IntegratorPort Driver IntegratorPortProvides IntegratorPortUses MeasurementPort MeasurementPort IntegratorProxy Component Performance

  13. MasterMind Component • Idea: Create a performance model for the component by tracking performance per invocation • Uses Monitor Port • Outputs: • Times per invocation, e.g. • Component call path • Regular performance data (uses performance component) # integ_proxy::integrate(double, double, int) # MPI_TIME Time count lowBound upBound 72420 336 10000 0 1 407 449 1000 0 1 364 540 100 0 1 64838 844 10000 0 1 381 945 1000 0 1 332 1027 100 0 1

  14. IntegratorPort MidpointIntegrator Monitor Proxy Component • Same idea (from the user’s point of view) Go IntegratorPort Driver IntegratorPortProvides IntegratorPortUses MonitorPort Integrator Monitor Proxy MonitorPort MeasurementPort MeasurementPort MasterMind Performance

  15. Tools Included with MasterMind Component • Tree pruner • Input: • Callgraph generated by Mastermind component • User specified rules • Output: • Pruned callgraph with insignificant nodes removed • Performance modeling library – brute force • Tries all possible permutations of component instances • Input: performance model of each component • Selects optimal component assembly for the ensemble • Optimizer • Swaps one component instance with another

  16. TAU’s Proxy Generator for SIDL/Classic CCA • Generate regular measurement proxy or monitor (MasterMind) proxy • Arguments: • Options: -c <component name> Full name of the component -t <type name> Type of component -p <port name> Name of port to generate proxy for -d <pdbfile name> Name of pdb file created from cxxparse -h <header file> Header file for this port -n <proxy name> Name of the proxy component (default: base of component name + Proxy) -o <output filename> Name of output file (default: proxy.cc) -f <selective instrumentation file> Use Pre-generated Selective instrumentation file -x <tag> Namespace Tag -m Generate MasterMind component proxy

  17. TAU’s Proxy Generator for Classic C++ Interface • Creating PDB Files: • Merging PDB Files: • Invoking tau_pg (example) cxxparse <file.cpp> -I<dir> -D<flags> • pdbmerge -o merged.pdb file1.pdb file2.pdb … • tau_pg -c integrators::ccaports::Integrator -t integrators.ccaports.Integrator -p IntegratorPort -d ParallelIntegrator_CCA.pdb -o Proxy.cc -h ports/Integrator_CCA.h -f select.dat

  18. Alternative implementationsof performance component What’s Going On Here? Application Component Application Component Application Component Application Component Performance Component … other API TAU API TAU API runtime TAU performance data

  19. Multi-Level Instrumentation • Inter-Component • Proxy components created automatically • Proxy interposed between caller and callee • Intra-Component • PDT based source instrumentation • Compiler scripts • mpif90 => tau_f90.sh • mpicxx => tau_cxx.sh • mpicc => tau_cc.sh • Framework level MPI instrumentation • Shared library MPI based CCAFFEINE framework • LD_PRELOAD based interposition of MPI wrapper • mpirun –np 4 ./ccafe-batch • mpirun –np 4 tau_load.sh ./ccafe-batch

  20. MasterMind Component • Idea: Create a performance model for the component by tracking performance per invocation • Uses Monitor Port • Outputs: • Times per invocation, e.g. • Component call path • Regular performance data (uses performance component) # integ_proxy::integrate(double, double, int) # MPI_TIME Time count lowBound upBound 72420 336 10000 0 1 407 449 1000 0 1 364 540 100 0 1 64838 844 10000 0 1 381 945 1000 0 1 332 1027 100 0 1

  21. Parameter Based Profiling for CQoS • Idea: partition performance data for individual functions based on runtime parameters • Enable by configuring with –PROFILEPARAM • TAU call: TAU_PROFILE_PARAM1L (value, “name”) • Simple example: void foo(long input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input"); ... }

  22. Parameter Based Profiling • 5 seconds spent in function “foo” becomes • 2 seconds for “foo [ <input> = <25> ]” • 1 seconds for “foo [ <input> = <5> ]” • … • Demonstrated in MPI wrapper library • Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node) • Can be extrapolated to infer specifics about the MPI subsystem and system as a whole

  23. Workload Characterization • Simple example, send/receive squared message sizes (0-32MB) #include <stdio.h> #include <mpi.h> int buffer[8*1024*1024]; int main(int argc, char **argv) { int rank, size, i, j; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<=8*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } } MPI_Finalize(); }

  24. Intel MPI (SGI Altix) Workload Characterization • Use tau_load.sh to instrument MPI routines (SGI Altix) % icc mpi.c –lmpi % mpirun –np 2 tau_load.sh –XrunTAU-icpc-mpi-pdt.so a.out SGI MPI (SGI Altix)

  25. Workload Characterization • Two different message sizes (~3.3MB and ~4K)

  26. Parameter Based Profiling: SIDL Interface package Performance version 1.7.2 { interface Timer { /* Start/stop the Timer */ void start(); void stop(); /* Set Profile Parameter */ void setParam1L(in long value, in string name); ... }

  27. PerfDMF: Performance Data Mgmt. Framework

  28. TAU Portal

  29. TAU Portal https://tau.nic.uoregon.edu

  30. TAU Portal: Application Specific Metadata Storage

  31. Performance Data Mining (PerfExplorer) • Performance knowledge discovery framework • Data mining analysis applied to parallel performance data • comparative, clustering, correlation, dimension reduction, … • Use the existing TAU infrastructure • TAU performance profiles, PerfDMF • Client-server based system architecture • Technology integration • Java API and toolkit for portability • PerfDMF • R-project/Omegahat, Octave/Matlab statistical analysis • WEKA data mining package • JFreeChart for visualization, vector output (EPS, SVG)

  32. Performance Data Mining (PerfExplorer)

  33. PerfExplorer - Interface Select analysis

  34. PerfExplorer - Relative Efficiency Plots

  35. PerfExplorer - Relative Efficiency by Routine

  36. PerfExplorer - Relative Speedup

  37. PerfExplorer - Timesteps Per Second

  38. Summary • Create component version of GAMESS, identify interfaces • Work with GAMESS and other application teams to apply TAU for inter and intra-component instrumentation • Gather requirements for swapping components • Generate proxy components for applications, gather performance data, store results in performance data • Cross-experiment application performance characterization • Develop prototype for CQoS • http://www.cs.uoregon.edu/research/paracomp/tau/cca

  39. Support Acknowledgements • Department of Energy (DOE) • Office of Science contracts • University of Utah DOE ASCI Level 1 sub-contract • DOE ASC/NNSA Level 3 contract • LLNL, LANL, ANL contracts • NSF Software and Tools for High-EndComputing Grant • Research Centre Juelich • John von Neumann Institute for Computing • Dr. Bernd Mohr

More Related