390 likes | 490 Views
Analysis Infrastructure for CQoS using TAU. Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory, NeuroInformatics Center University of Oregon. Acknowledgement.
E N D
Analysis Infrastructure for CQoS using TAU Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory, NeuroInformatics Center University of Oregon
Acknowledgement • Jaideep Ray, SNL • Lois McIness, ANL • David Bernholdt, ORNL • Boyana Norris, ANL • Robert Yelle, U. Oregon
Outline • Motivation: CQoS • Instrumentation • Measurement • Analysis tools
S CQoS in GAMESS • Robert Yelle, PRL, U. Oregon ryelle@uoregon.edu • Calculate the energy of Thiophene molecule using different algorithms FINAL U-B3LYP ENERGY IS -552.9083139587 AFTER 21 ITERATIONS FINAL U-BLYP ENERGY IS -552.9861184848 AFTER 22 ITERATIONS FINAL UHF ENERGY IS -551.3483315053 AFTER 11 ITERATIONS FINAL U-SVWN ENERGY IS -550.2734639639 AFTER 22 ITERATIONS
TAU Performance System Framework • Tuning and Analysis Utilities • Performance system framework for scalable parallel and distributed high-performance computing • Targets a general complex system computation model • nodes / contexts / threads • Multi-level: system / software / parallelism • Measurement and analysis abstraction • Integrated toolkit for performance instrumentation, measurement, analysis, and visualization • Portable, configurable performance profiling/tracing facility • Open software approach • University of Oregon, LANL, FZJ Germany • http://www.cs.uoregon.edu/research/paracomp/tau
TAU Performance System Architecture event selection
Performance Evaluation Alternatives Depthlimit profile Callpath/callgraph profile Parameter profile Trace Phase profile Flat profile • Each alternative has: • one metric/counter • multiple counters Volume of performance data
Enhancements in TAU to support CQoS • Instrumentation • Runtime MPI wrapper interposition for CCA framework instrumentation • Automatic proxy component creation for classic and SIDL components • PDT v3.10 (coming, beta released) supports EDG v3.8 for better C/C++ parsing support (GNU extensions, BOOST, ASM statements) • Profile Measurement • Parameter based profiling to capture application data • Context Events to capture callpath with user • Support for memory profiling and memory leak detection • Timestamped profile snapshots (coming) • Analysis • Extensions to PerfDMF to support model storage • Application specific metadata • ParaProf extensions to display profile snapshots, parameter based profiles • PerfExplorer data mining framework • Web based access to performance database via a TAU portal • Ability to store images, share data, metadata
TAU’s CCA Performance Component: Core API • Measurement port and interfaces • Timer • set name/type/group • start/stop • Phase • set name/type/group • start/stop • Control • enable/disable groups • Query • get timer names, get metric names, get user-defined event names • get timer data, get user-defined event data, dump data to disk • Event • set name, trigger event • Context Event (callpath of routines + user event information) • set name, trigger event • MemoryTracker and MemoryHeadroomTracker • enable interrupt tracking, track memory/headroom here, set interrupt interval • enable/disable tracking memory/headroom
TAU’s CCA Interfaces • Performance evaluation using Performance component • Uses underlying TAU library for measurement • Timer, Phase, Event/ContextEvent, Control, Query, MemoryTracker/MemoryHeadroomTracker interfaces • Lightweight instrumentation option • Performance modeling using Mastermind component • Tracks per-invocation performance data • Associates performance data with application data • Method arguments logged with performance data • Callpath information • Helps us build performance models • Updated performance component 1.7.2 released Jan. ’07
interface Timer { /* Start/stop the Timer */ void start(); void stop(); /* Set/get the Timer name */ void setName(in string name); string getName(); /* Set/get Timer type information (e.g., signature of the routine) */ void setType(in string name); string getType(); /* Set/get the group name associated with the Timer */ void setGroupName(in string name); string getGroupName(); /* Set/get the group id associated with the Timer */ void setGroupId(in long group); long getGroupId(); } interface Measurement extends gov.cca.Port { /* Create a Timer */ Timer createTimer(); Timer createTimerWithName(in string name); Timer createTimerWithNameType(in string name, in string type); Timer createTimerWithNameTypeGroup(in string name, in string type, in string group); interface Phase { /* Start/stop the Phase */ void start(); void stop(); /* Set/get the Phase name */ void setName(in string name); string getName(); /* Set/get Phase type information (e.g., signature of the routine) */ void setType(in string name); string getType(); /* Set/get the group name associated with the Phase */ void setGroupName(in string name); string getGroupName(); /* Set/get the group id associated with the Phase */ void setGroupId(in long group); long getGroupId(); } interface Measurement extends gov.cca.Port { /* Create a Phase */ Phase createPhase(); Phase createPhaseWithName(in string name); Phase createPhaseWithNameType(in string name, in string type); Phase createPhaseWithNameTypeGroup(in string name, in string type, in string group); Phase Interface
IntegratorPort MidpointIntegrator Measurement Proxy Component • Interpose a proxy component for each port • Inside the proxy • Make calls to Performance component for each invocation Go IntegratorPort Driver IntegratorPortProvides IntegratorPortUses MeasurementPort MeasurementPort IntegratorProxy Component Performance
MasterMind Component • Idea: Create a performance model for the component by tracking performance per invocation • Uses Monitor Port • Outputs: • Times per invocation, e.g. • Component call path • Regular performance data (uses performance component) # integ_proxy::integrate(double, double, int) # MPI_TIME Time count lowBound upBound 72420 336 10000 0 1 407 449 1000 0 1 364 540 100 0 1 64838 844 10000 0 1 381 945 1000 0 1 332 1027 100 0 1
IntegratorPort MidpointIntegrator Monitor Proxy Component • Same idea (from the user’s point of view) Go IntegratorPort Driver IntegratorPortProvides IntegratorPortUses MonitorPort Integrator Monitor Proxy MonitorPort MeasurementPort MeasurementPort MasterMind Performance
Tools Included with MasterMind Component • Tree pruner • Input: • Callgraph generated by Mastermind component • User specified rules • Output: • Pruned callgraph with insignificant nodes removed • Performance modeling library – brute force • Tries all possible permutations of component instances • Input: performance model of each component • Selects optimal component assembly for the ensemble • Optimizer • Swaps one component instance with another
TAU’s Proxy Generator for SIDL/Classic CCA • Generate regular measurement proxy or monitor (MasterMind) proxy • Arguments: • Options: -c <component name> Full name of the component -t <type name> Type of component -p <port name> Name of port to generate proxy for -d <pdbfile name> Name of pdb file created from cxxparse -h <header file> Header file for this port -n <proxy name> Name of the proxy component (default: base of component name + Proxy) -o <output filename> Name of output file (default: proxy.cc) -f <selective instrumentation file> Use Pre-generated Selective instrumentation file -x <tag> Namespace Tag -m Generate MasterMind component proxy
TAU’s Proxy Generator for Classic C++ Interface • Creating PDB Files: • Merging PDB Files: • Invoking tau_pg (example) cxxparse <file.cpp> -I<dir> -D<flags> • pdbmerge -o merged.pdb file1.pdb file2.pdb … • tau_pg -c integrators::ccaports::Integrator -t integrators.ccaports.Integrator -p IntegratorPort -d ParallelIntegrator_CCA.pdb -o Proxy.cc -h ports/Integrator_CCA.h -f select.dat
Alternative implementationsof performance component What’s Going On Here? Application Component Application Component Application Component Application Component Performance Component … other API TAU API TAU API runtime TAU performance data
Multi-Level Instrumentation • Inter-Component • Proxy components created automatically • Proxy interposed between caller and callee • Intra-Component • PDT based source instrumentation • Compiler scripts • mpif90 => tau_f90.sh • mpicxx => tau_cxx.sh • mpicc => tau_cc.sh • Framework level MPI instrumentation • Shared library MPI based CCAFFEINE framework • LD_PRELOAD based interposition of MPI wrapper • mpirun –np 4 ./ccafe-batch • mpirun –np 4 tau_load.sh ./ccafe-batch
MasterMind Component • Idea: Create a performance model for the component by tracking performance per invocation • Uses Monitor Port • Outputs: • Times per invocation, e.g. • Component call path • Regular performance data (uses performance component) # integ_proxy::integrate(double, double, int) # MPI_TIME Time count lowBound upBound 72420 336 10000 0 1 407 449 1000 0 1 364 540 100 0 1 64838 844 10000 0 1 381 945 1000 0 1 332 1027 100 0 1
Parameter Based Profiling for CQoS • Idea: partition performance data for individual functions based on runtime parameters • Enable by configuring with –PROFILEPARAM • TAU call: TAU_PROFILE_PARAM1L (value, “name”) • Simple example: void foo(long input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input"); ... }
Parameter Based Profiling • 5 seconds spent in function “foo” becomes • 2 seconds for “foo [ <input> = <25> ]” • 1 seconds for “foo [ <input> = <5> ]” • … • Demonstrated in MPI wrapper library • Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node) • Can be extrapolated to infer specifics about the MPI subsystem and system as a whole
Workload Characterization • Simple example, send/receive squared message sizes (0-32MB) #include <stdio.h> #include <mpi.h> int buffer[8*1024*1024]; int main(int argc, char **argv) { int rank, size, i, j; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<=8*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } } MPI_Finalize(); }
Intel MPI (SGI Altix) Workload Characterization • Use tau_load.sh to instrument MPI routines (SGI Altix) % icc mpi.c –lmpi % mpirun –np 2 tau_load.sh –XrunTAU-icpc-mpi-pdt.so a.out SGI MPI (SGI Altix)
Workload Characterization • Two different message sizes (~3.3MB and ~4K)
Parameter Based Profiling: SIDL Interface package Performance version 1.7.2 { interface Timer { /* Start/stop the Timer */ void start(); void stop(); /* Set Profile Parameter */ void setParam1L(in long value, in string name); ... }
Performance Data Mining (PerfExplorer) • Performance knowledge discovery framework • Data mining analysis applied to parallel performance data • comparative, clustering, correlation, dimension reduction, … • Use the existing TAU infrastructure • TAU performance profiles, PerfDMF • Client-server based system architecture • Technology integration • Java API and toolkit for portability • PerfDMF • R-project/Omegahat, Octave/Matlab statistical analysis • WEKA data mining package • JFreeChart for visualization, vector output (EPS, SVG)
PerfExplorer - Interface Select analysis
Summary • Create component version of GAMESS, identify interfaces • Work with GAMESS and other application teams to apply TAU for inter and intra-component instrumentation • Gather requirements for swapping components • Generate proxy components for applications, gather performance data, store results in performance data • Cross-experiment application performance characterization • Develop prototype for CQoS • http://www.cs.uoregon.edu/research/paracomp/tau/cca
Support Acknowledgements • Department of Energy (DOE) • Office of Science contracts • University of Utah DOE ASCI Level 1 sub-contract • DOE ASC/NNSA Level 3 contract • LLNL, LANL, ANL contracts • NSF Software and Tools for High-EndComputing Grant • Research Centre Juelich • John von Neumann Institute for Computing • Dr. Bernd Mohr