410 likes | 434 Views
Explore TAU framework for analyzing parallel software & systems performance. Learn about tools, instrumentation, measurement, & analysis techniques.
E N D
TAU: A Framework for Parallel Performance Analysis Allen D. Malony malony@cs.uoregon.edu ParaDucks Research Group Computer & Information Science Department Computational Science Institute University of Oregon
Outline • Goals and challenges • Targeted research areas • TAU (Tuning and Analysis Utilities) • computation model, architecture, toolkit framework • performance system technology • examples of TAU use • Tools associated with TAU • PDT (Program Database Toolkit) • distributed runtime monitoring • Future plans • Conclusions Ptools Annual Meeting
Goal and Challenges Create robust (performance) technology for the analysis and tuning of parallel software and systems • Challenges • different scalable computing platforms • different programming languages and systems • common, portable framework for analysis • extensibe, retargetable tool technology • complex set of requirements Ptools Annual Meeting
Targeted Research Areas • Performance analysisfor scalable parallel systems targeting multiple programming and system levelsand the mapping between levels • Program code analysisfor multiple languages enabling development of new source-based tools • Integrationandinteroperation support for building analysis tool frameworks and environments • Runtime tool interactionfor dynamic applications Ptools Annual Meeting
Network TAU (Tuning and Analysis Utilities) • Performance analysis framework for scalable parallel and distributed high-performance computing • Target a general parallel computation model • computer nodes • shared address space contexts • threads of execution • multi-level parallelism • Integrated toolkit for performance instrumentation, measurement, analysis, and visualization • portable performance profiling/tracing facility • open software approach Ptools Annual Meeting
TAU Architecture Ptools Annual Meeting
TAU Instrumentation • Flexible, multiple instrumentation mechanisms • source code • manual • automatic using PDT (tau_instrumentor) • object code • pre-instrumented libraries • statically linked: MPI wrapper library using the MPI Profiling Interface (libTauMpi.a) • dynamically linked: Java instrumentation using JVMPI and TAU shared object dynamically loaded in VM • executable code • dynamic instrumentation using DyninstAPI (tau_run) Ptools Annual Meeting
TAU Instrumentation (continued) • Common target measurement interface (TAU API) • C++ (object-based) instrumentation • macro-based, using constructor/destructor techniques • function, classes, and templates • uniquely identify functions and templates • name and type signature (name registration) • static object creates performance entry • dynamic object receives static object pointer • runtime type identification for template instantiations • with C and Fortran instrumentation variants • Instrumentation optimization Ptools Annual Meeting
TAU Measurement • Performance information • high resolution timer library (real-time clock) • generalized software counter library • hardware performance counters • PCL (Performance Counter Library) (ZAM, Germany) • PAPI (Performance API) (UTK, Ptools) • consistent, portable API • Organization • node, context, thread levels • profile groups for collective events (runtime selective) • mapping between software levels Ptools Annual Meeting
TAU Measurement (continued) • Profiling • function-level, block-level, statement-level • supports user-defined events • TAU profile (function) database (PD) • function callstack • hardware counts instead of time • Tracing • profile-level events • interprocess communication events • timestamp synchronization • User-controlled configuration (configure) Ptools Annual Meeting
Timing of Multi-threaded Applications • Capture timing information on per thread basis • Two alternative • wall clock time • works on all systems • user-level measurement • OS-maintained CPU time (e.g., Solaris, Linux) • thread virtual time measurement • TAU supports both alternatives • CPUTIME module profiles user+system time % configure -pthread -CPUTIME Ptools Annual Meeting
TAU Analysis • Profile analysis • pprof • parallel profiler with text-based display • racy • graphical interface to pprof • Trace analysis • trace merging and clock adjustment (if necessary) • trace format conversion (ALOG, SDDF, PV, Vampir) • Vampir • trace analysis and visualization tool (Pallas) Ptools Annual Meeting
TAU Status • Usage • platforms • IBM SP, SGI Origin 2K, Intel Teraflop, Cray T3E, HP, Sun, Windows 95/98/NT, Alpha/Pentium Linux cluster • languages • C, C++, Fortran 77/90, HPF, pC++, HPC++, Java • communication libraries • MPI, PVM, Nexus, Tulip, ACLMPL • thread libraries • pthreads, Tulip, SMARTS, Java,Windows • compilers • KAI, PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray Ptools Annual Meeting
TAU Status (continued) • application libraries • Blitz++, A++/P++, ACLVIS, PAWS • application frameworks • POOMA, POOMA-2, MC++, Conejo, PaRP • other projects • ACPC, University of Vienna: Aurora • UC Berkeley (Culler): Millenium, sensitivity analysis • KAI and Pallas • TAU profiling and tracing toolkit (Version 2.7) • LANL ACL Fall 1999 CD-ROM distributed at SC'99 • Extensive 70-page TAU User’s Guide • http://www.acl.lanl.gov/tau Ptools Annual Meeting
TAU Examples • Instrumentation • C++ template profiling (PETE, Blitz++) • Java and MPI • PAPI • Measurement • mapping of asynchronous execution (SMARTS) • hybrid execution (Opus/HPF) • Analysis • SMARTS scheduling Ptools Annual Meeting
C++ Template Instrumentation (Blitz++, PETE) • High-level objects • array classes • templates • Optimizations • array processing • expressions (PETE) • Relate performance data to high-level statement • Complexity of template evaluation Array expressions Ptools Annual Meeting
Standard Template Instrumentation Difficulties • Instantiated templates result in mangled identifiers • Standard profiling techniques and tools are deficient • integrated with proprietary compilers • specific systems platforms and programming models Uninterpretable routine names Ptools Annual Meeting
TAU Template Instrumentation and Profiling Profile ofexpressiontypes Performance data presentedwith respect to high-levelarray expression types Graphical pprof Ptools Annual Meeting
Parallel Java Performance Instrumentation • Multi-language applications (Java, C, C++, Fortran) • Hybrid execution models (Java threads, MPI) • Java Virtual Machine Profiler Interface (JVMPI) • event instrumentation in JVM • profiler agent (libTAU.so) fields events • Java Native Interface (JNI) • invoke JVMPI control routines to control Java threads and access thread information • MPI profiling interface • “Performance Tools for Parallel Java Environments,” Java Workshop, ICS 2000, May 2000. Ptools Annual Meeting
Thread API TAU Java Instrumentation Architecture Java program mpiJava package TAU package JNI MPI profiling interface Event notification TAU TAU wrapper Native MPI library JVMPI Profile DB Ptools Annual Meeting
mpiJava testcase 4 nodes,28 threads Nodeprocessgrouping Threadmessagepairing Vampirdisplay Multi-level event grouping Parallel Java Game of Life Ptools Annual Meeting
TAU and PAPI: NAS Parallel LU Benchmark • SGI Power Onyx (4 processors, R10K), MPI • Floating pointoperations • Cross-nodefull / routineprofiles • Full FPprofile foreach node Percentage profile Ptools Annual Meeting
TAU and PAPI: Matrix Multiply • Data cache miss comparison, • “regular” vs. “strip-mining” execution • 512x51232 KB (P)2 MB (S) • Regularcauses4.5 timesmoremisses Ptools Annual Meeting
Asynchronous Performance Analysis (SMARTS) • Scalable Multithreaded Asynchronuous Runtime System • user-level threads, light-weight virtual processors • macro-dataflow, asynchronous execution interleaving iterates from data-parallel statements • integrated with POOMA II • TAU measurement of asynchronous parallel execution • utilized the TAU mapping API • associate iterate performance with data parallel statement • evaluate different scheduling policies • “SMARTS: Exploting Temporal Locality & Parallelism through Vertical Execution,” ICS '99, August 1999. Ptools Annual Meeting
TAU Mapping of Asynchronous Execution Without mapping Two threadsexecuting With mapping POOMA / SMARTS Ptools Annual Meeting
With and without mapping (Thread 0) Without mapping Thread 0 blockswaiting for iterates Iterates get lumped together With mapping Iterates distinguished Ptools Annual Meeting
With and without mapping (Thread 1) Array initialization performance lumped Without mapping Performance associated with ExpressionKernel object With mapping Iterate performance mapped to array statement Array initialization performancecorrectly separated Ptools Annual Meeting
TAU and Hybrid Execution in Opus/HPF • Fortran 77, Fortran 90, HPF • Vienna Fortran Compiling System • Opus / HPF • combined data (HPF) and task (Opus) parallelism • HPF compiler produces Fortran 90 modules • processes interoperate using Opus runtime system • producer / consumer model • MPI and pthreads • performance influence at multiple software levels Ptools Annual Meeting
TAU Profiling of Opus/HPF Application Multiple producers Multiple consumers Parallelism View Ptools Annual Meeting
TAU Profiling of SMARTS Iteration scheduling for two array expressions Ptools Annual Meeting
SMARTS Tracing (SOR) – Vampir Visualization • SCVE scheduler used in Red/Black SOR running on 32 processors of SGI Origin 2000 Asynchronous, overlapped parallelism Ptools Annual Meeting
Program Database Toolkit (PDT) • Program code analysis framework for developing source-based tools • High-level interface to source code information • Integrated toolkit for source code parsing, database creation, and database query • commercial grade front end parsers • portable IL analyzer, database format, and access API • open software approach for tool development • Target and integrate multiple source languages • http://www.acl.lanl.gov/pdtoolkit Ptools Annual Meeting
PDT Architecture and Tools Ptools Annual Meeting
PDT Summary • Program Database Toolkit (Version 1.1) • LANL ACL Fall 1999 CD-ROM distributed at SC'99 • EDG C++ Front End (Version 2.41.2) • C++ IL Analyzer and DUCTAPE library • tools: pdbmerge, pdbconv, pdbtree, pdbhtml • standard C++ system header files (KAI KCC 3.4c) • Fortran 90 IL Analyzer in progress • Automated TAU performance instrumentation • Program analysis support for SILOON (ACL CD) • “A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software,” submitted to SC ’00. Ptools Annual Meeting
Distributed Monitoring Framework • Extend usability of TAU performance analysis • Access TAU performance data during execution • Framework model • each application context is a performance data server • monitor agent thread is created within each context • client processes attach to agents and request data • server thread synchronization for data consistency • pull mode of interaction • Distributed TAU performance data space • “A Runtime Monitoring Framework for the TAU Profiling System,” ISCOPE ’99, Nov. 1999. Ptools Annual Meeting
TAU Distributed Monitor Architecture TAU profile database • Each context has a monitor agent • Client in separatethread directs agent • Pull model ofinteraction • Initial HPC++implementation Ptools Annual Meeting
Java Implementation of TAU Monitor • Motivations • more portable monitor middleware system (RMI) • more flexible and programmable server interface (JNI) • more robust client development (EJB, JDBC, Swing) Ptools Annual Meeting
Future Plans • TAU • platforms: SGI Itanium, Sun Starfire, IBM Linux, ... • languages: Java (Java Grande) , OpenMP • instrument: automatic (F90, Java), Dyninst • measurement: hardware counter, support PAPI • displays: “beyond bargraphs” performance views • performance database and technology • support for multiple runs • open API for analysis tool development • PDT • complete F90 and Java IL Analyzer • source browsers: function, class, template • tools for aiding in data marshalling and translation Ptools Annual Meeting
Future Plans (continued) • Distributed monitoring framework • application and system monitoring • ACL Supermon and SGI Performance Co-Pilot • scalable SMP clusters and distributed systems • performance monitoring clients • Performance evaluation • numerical libraries and frameworks • scalable runtime systems • ASCI application developers (benchmark codes) • Investigate performance issues in Linux kernel • Investigate integration with CCA Ptools Annual Meeting
Conclusions • Complex parallel computing environments require robust program analysis tools • portable, cross-platform, multi-level, integrated • able to bridge and reuse existing technology • technology savvy • TAU offers a robust performance technology framework for complex parallel computing systems • flexible instrumentation and instrumentation • extendable profile and trace performance analysis • integration with other performance technology • Opportunities exist for open performance technology Ptools Annual Meeting
Open Performance Technology (OPT) • Performance problem is complex • diverse platforms, software development, applications • things evolve • History of incompatible and competing tools • instrumentation / measurement technology reinvention • lack of common, reusable software foundations • Need “value added” (open) approach • technology for high-level performance tool development • layered performance tool architecture • portable, flexible, programmable, integrative technology • Opportunity for Industry/National Labs/PACI sites Ptools Annual Meeting