1 / 53

The TAU Experience and Future Performance Tools: Evolution or Revolution?

The TAU Experience and Future Performance Tools: Evolution or Revolution?. Allen D. Malony , Sameer Shende , Kevin Huck Nick Chaimov , David Ozog Department of Computer and Information Science University of Oregon. Outline. My summer influences Perspective

amma
Download Presentation

The TAU Experience and Future Performance Tools: Evolution or Revolution?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The TAU Experience and FuturePerformance Tools: Evolution or Revolution? Allen D. Malony, Sameer Shende, Kevin HuckNick Chaimov, David Ozog Department of Computer and Information Science University of Oregon

  2. Outline • My summer influences • Perspective • Performance observation and performance engineering • Retrospective (circa 1991, 1998-2007, 2008-present) • The TAU experience • History • Application performance studies • Evolution • Extreme performance engineering • Revolution • Online performance dynamics • Performance as collective behavior

  3. Summer Influences – Blue Latitudes • Blue Latitudes, Tony Horwitz • Boldly Going Where Captain CookHas Gone Before • “Two centuries after James Cook's epicvoyages of discovery, Horwitz recapturestheCaptain’s adventures and explore hisembattledlegacy in today’s Pacific.” • Captain James Cook • From impoverished farmboy to Britain’s greatest navigator • Helped create the global village we inhabit today • “Do just once what others say you can’t do, and you will never pay attention to their limitations again.”

  4. Summer Influences – The Jimi Hendrix Experience • 1960’s American-English rock band • Widely recognized as hugely influential onrock ‘n roll, hard rock, and heavy metal • Power trio format: guitar, bass, drums • Encouraged extroverted playing • London’s Bag O’ Nails nightclub (1/11/1967) • Pete Townshend: "[Jimi] changed the wholesoundof electric guitar and turned the rockworldupside down.” • Eric Clapton: “I thought that was it, the gamewasup for all of us, we may as well pack it in.”“He played just about every style you couldthink of, and not in a flashy way. … and thatwas it … He walked off, and my life wasnever the same again”

  5. Perspective – Parallel Tools Experience • Performance has been the fundamental driving concern for parallel computing high-performance computing • Performanceobservation, modeling, andengineering are key methodologies to deliver the potential of parallel, HPC machines • However, there has always been a strained relationshipbetween performance tools and the parallel computing • Performance considered necessary, but an afterthought • There are compelling reasons for performance methods and technologies to be integrated throughout parallel computing • To obtain, learn, and carry forward performance knowledge • To address increasing scalability and complexity issues • To bridge between programming semantics (computation model) and execution model operation

  6. Parallel Performance Engineering • Scalable, optimized applications deliver HPC promise • Optimization through performance engineering process • Understand performance complexity and inefficiencies • Tune application to run optimally on high-end machines • How to make the process more effective and productive? • What is the nature of the performance problem solving? • What is the performance technology to be applied? • Performance tool efforts have been focused on performance observation, analysis, problem diagnosis • Application development and optimization productivity • Programmability, reusability, portability, robustness • Performance technology part of larger programming system • Parallel systems evolution will change process, technology, use

  7. Retrospective (1991) – Performance Observability • Performance evaluation problems define the requirements for performance measurement and analysis methods • Performance observability is the ability to “accurately” capture, analyze, and present understand (collectively observe) information about parallel software and system • Tools for performance observability must balance the need for performance data against the cost of obtaining it (environment complexity, performance intrusion) • Too little performance data makes analysis difficult • Too much data perturbs the measured system • Important to understand performance observability complexity and develop technology to address it

  8. Instrumentation Uncertainty Principle (Ph.D. thesis) • All performance measurements generate overhead • Overhead is the cost of performance measurement • Overhead causes (generates) performance intrusion • Intrusionis the dynamic performance effect of overhead • Intrusion causes (potentially) performance perturbation • Perturbation is the change in performance behavior • Measurement does not change “possible” executions !!! • Perturbation can lead to erroneous performance data (Why?) • Alteration of “probable” performance behavior (How?) • Instrumentation uncertainty principle • Any instrumentation perturbs the system state • Execution phenomena and instrumentation are coupled • Volume and accuracy are antithetical • Perturbation analysis is possible with deterministic execution • Otherwise, measurement must be considered part of the system

  9. Perturbation Analysis – Livermore Loops Full: full instrumentation Raw: uninstrumentedModel: perbation compensation • Concurrent performance analysis requires more events! • Must model whole runtime software and possibly machine!

  10. Retrospective (1998-2007) – Performance Diagnosis Performance diagnosis is a process to detect and explain performance problems

  11. Performance Diagnosis Projects • APART – Automatic Performance Analysis - Real Tools • Problem specification and identification • Poirot – theory ofperformance diagnosis processes • Compare and analyze performance diagnosis systems • Use theory to create system that is automated / adaptable • Heuristic classification: match to characteristics • Heuristic search: look up solutions with problem knowledge • Problems: low-level feedback, lack of explanation power • Hercule – knowledge-based (model-based) diagnosis • Capture knowledge about performance problems • Capture knowledge about how to detect and explain them • Knowledge comes from parallel computational models • associate computational models with performance models

  12. Communication Perturbation Analysis • Trace-based analysis of message passing communication • Removal of measurement intrusion • Modeling of communication perturbation and removal • Apply to “what if” prediction of application performance • Different communication speeds • Faster processing • Apply to “noise” analysis and overhead compensation • “Trace-based Parallel Performance Overhead Compensation,” HPCC 2005, with F. Wolf • “The Ghost in the Machine Observing the Effects of Kernel Operation on Parallel Application Performance,” SC’07 • Perturbation analysis for removal of perturbation not caused by measurement system • Add more measurement to get higher fidelity analysis!

  13. Retrospective (2008-2012) – Performance Complexity • Performance tools have evolved incrementally to serve the dominant architectures and programming models • Reasonably stable, static parallel execution models • Allowed application-level observation focus • Observation requirements by first person measurement model: • Performance measurement can be made locally (per thread) • Performance data collected at the end of the execution • Post-mortem analysis and presentation of performance results • Offline performance engineering • Increasing performance complexity • Factors: core counts, hierarchical memory architecture, interconnection technology, heterogeneity, and scale • Focus on performance technology integration

  14. TAU Performance System® (http://tau.uoregon.edu) • Tuning and Analysis Utilities (20+ year project) • Performance problem solving framework for HPC • Integrated, scalable, flexible, portable • Target all parallel programming / execution paradigms • Integrated performance toolkit • Multi-level performance instrumentation • Flexible and configurable performance measurement • Widely-ported performance profiling / tracing system • Performance data management and data mining • Open source (BSD-style license) • Broad use in complex software, systems, applications

  15. TAU History 1992-1995: Malony and Mohr work with Gannon on DARPA pC++ project work. TAU is born. [parallel profiling, tracing, performance extrapolation] 1995-1998: Shende works on Ph.D. research on performance mapping. TAU v1.0 released. [multiple languages, source analysis, automatic instrumentation] 1998-2001: Significant effort in Fortran analysis and instrumentation, work with Mohr on POMP, Kojak tracing integration, focus on automated performance analysis. [performance diagnosis, source analysis, instrumentation] 2002-2005: Focus on profiling analysis tools, measurement scalability, and perturbation compensation. [analysis, scalability, perturbation analysis, applications] 2005-2007: More emphasis on tool integration, usability, and data presentation. TAU v2.0 released. [performance visualization, binary instrumentation, integration, performance diagnosis and modeling] 2008-2011: Add performance database support, data mining, and rule-based analysis. Develop measurement/analysis for heterogeneous systems. Core measurement infrastructure integration (Score-P). [database, data mining, expert system, heterogeneous measurement, infrastructure integration] 2012-present: Focus on exascale systems. Improve scalability. Add hybrid measurement support, extend heterogeneous and mixed-mode, develop user-level threading. Apply to petascale / exascale applications. [scale, autotuning, user-level]

  16. NWChemPerformance Study • NWChem is a leading chemistry modeling code • NWChemrelies on Global Arrays (GA) • Provides a global view of a physically distributed array • One-sided access to arbitrary patches of data • Developed as a library (fully interoperable with MPI) • Aggregate Remote Memory Copy Interface (ARMCI) • GA communication substrate for one-sided communication • Portable high-performance one-sided communication library • Rich set of remote memory access primitives • Would like to better understand the performance of representative workloads for NWChem on different platforms • Help to create use cases for one-side programming models

  17. NWChem One-sided Communication and Scaling • Understand interplay between data-server and compute processes as a function of scaling • Data-server uses a separate thread • Large numerical computation per node at small scale can obscure the cost of maintaining passive-target progress • Larger scale decreases numerical work per node and increases the fragmentation of data, increasing messages • Vary #nodes, cores-per-node, and memory buffer pinning • Understand trade-off of core allocation • All to computation versus some to communication J. Hammond, S. Krishnamoorthy, S. Shende, N. Romero, A. Malony, “Performance Characterization of Global Address Space Applications: a Case Study with NWChem,” Concurrency and Computation: Practice and Experience, Vol 24, No. 2, pp. 135-154, 2012.

  18. NWChem Instrumentation • Source-base instrumentation of NWChem application • Developed an ARMCI interposition library (PARMCI) • Defines weak symbols and name-shifted PARMCI interface • Similar to PMPI for MPI • Developed a TAU PARMCI library • Intervals events around interface routines • Atomic events capture communication size and destination • Wrapped external libraries • BLAS (DGEMM) • Need portable instrumentation for cross-platform runs • Systems • Fusion: Linux cluster, Pacific Northwest National Lab • Intrepid: IBM BG/P, Argonne National Lab (Note: Runs on Hopper and Mira will scale greater, but will possibly show similar effects.)

  19. FUSION Tests Comparing No Pinning vs. Pinning • Scaling on 24, 32, 48, 64, 96 and 128 nodes • Test on 8 cores (no separate data server thread) • With no pinning ARMCI communication overhead increases dramatically and no scaling is observed • Pinning communication buffers shows dramatic effects • Relative communication overhead increases, but not dramatically DGEMM DGEMM

  20. Intrepid Tests Comparing No Pinning vs. Pinning • Scaling on 64, 128, 256 and 512 nodes • Tests with interrupt or communication helper thread (CHT) • CHT requires a core to be allocated • ARMCI calls are barely noticeable • DAXPY calculation shows up more • CHT performs better in both SMP and DUAL modes

  21. Electronic Structure in Computational Chemistry • Parallel performance is determined by: • How the application is design and developed • The nature and characteristics of the problem • Computational chemistry applications can exhibit: • Highly symmetric diverse load (e.g., Benzene) • Asymmetric unpredictable load (e.g., water clusters) • QM/MM sheer large size (e.g., macro-molecules) • Load balance is crucially important for performance Benzene Water Clusters Macro-Molecules

  22. NWChem Performance Analysis – NXTVAL mean inclusive time • Focus on NXTVAL • Global atomic counterkeeping track of tasks sent • Strong scaling experiment • 14 water molecules • aug-cc-PVDZ dataset • 124 nodes on ANL Fusion • 8 cores per node • NXTVAL significant % • Increasing per call time • When arrival rate exceedsprocessing rate, buffering andflow control must be used NXTVAL floodingmicro-benchmark NXTVAL NXTVAL

  23. Evaluation of Inspector-Executor Algorithm • How to eliminate the overheadof centralized load balancealgorithm based on NXTVAL • Use an inspector-executorapproach to assigning tasks • Assess taskimbalance • Reassign • Use TAU to evaluate performance improvement with respect to NXTVAL, overhead, task balance D. Ozog, J. Hammond, J. Dinan, P. Balaji, S. Shende, A. Malony, “Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions,” to appear in International Conference on Parallel Processing, September 2013.

  24. Refinement from NXTVAL to Inspector/Executor • Original NXTVALmeasured • Original NXTVALreduced • Inspector/Executor1st iteration overhead • Inspector/Executorsubsequent iterations normalize iteration time

  25. MPAS (Model for Prediction Across Scales) • MPAS is an unstructured gridapproach to climate system modeling • Explore regional-scale climate change • MPAS supports both quasi-uniform and variable resolution meshing of the sphere • Quadrilaterals, triangles, or Voronoi tessellations • MPAS is a software framework for the rapid prototyping of single components of climate system models • Two SciDAC earth systems codes (dynamical cores) • MPAS-O (ocean model) designed for the simulation of the ocean system from time scales of months to millenia • CAM-SE (atmosphere model) • http://mpas-dev.github.io

  26. Multiscale and MPAS-O Domain Decomposition • Use multiscale methods for accurate,efficient, and scale-aware models ofthe earth system • MPAS-O uses a variable resolutionirregular mesh of hexagonal grid cells • Cells assigned to MPI processes,grouped as “blocks” • Each cell has 1-40 vertical layers,depending on ocean depth • MPAS-O has demonstrated scalinglimits when using MPI alone • Look at increasing concurrency • Developers currently adding OpenMP • Both split explicit and RK4 solvers

  27. MPAS-Ocean Performance Study • Integrate TAU into the MPAS build system • Evaluate the original MPI-only approach • Study the the new MPI+OpenMP approach • Performance results • MPI block + OpenMP element decomposition • Reduces total instructions in computational regions • ~10% faster than MPI alone • Guided OpenMP thread schedule balances work across threads • ~6% faster than default • Weighted block decomposition using vertical elements (depth) could balancing work across processes (~5% faster in some tests) • Overlapping communication and computation could reduce synchronization delays when exchanging halo regions (underway) • Evaluation is ongoing and includes porting to MIC platform

  28. MPI Scaling Study (Hopper) • Strong scaling • 192 to 16800 • Poor scaling over6144 processes • Communication beginsto dominate • Points to problem sizebeing too small • Time profile byevents • Only MPI events > 5%after 12K processes

  29. Benefits of RK4 solver and OpenMP Scheduling • Use modified RK4 solver versus original MPI solver • Test cases • 64 cores • 64 processes (MPI-only) • 32 processes x 2 threads • 16 processes x 4 threads • 96 cores • 96 processes (MPI-only) • 48 processes x 2 threads • 32 processes x 3 threads • 16 processes x 6 threads • OpenMP scheduling options

  30. Benefits of Guided Scheduling with MPAS-O RK4 • MPI + OpenMP experiment • 96 cores (16 processes by 6 threads per process) • Default scheduling generates a load imbalance • Threads arrive at boundary cell exchange at different times • Complete boundary cell exchange at different rates • only the master thread performs the exchange • Guided scheduling provides better balance • asdfas OpenMP Barrier OpenMP Barrier MPI_Wait MPI_Wait Guided Default

  31. MPAS-O on BG/Q (Vesta) on 16K (256x64) MPI_Wait() master threadhalo exchanges OpenMP worker threads executing solver routines, including time spent waiting on halo exchanges by master threads

  32. Use of Hybrid Profiling to Evaluate Vectorization • Use TAU’s support of general hybrid profiling to understand vectorization effects in MPAS-O • Computation and synchronization have strong correlation • Depth & color show effect of vectorization(faster, uniform) • Tightens communication waiting (height &width) non-vectorized vectorized

  33. IRMHD Performance on Argonne Intrepid and Mira • INCITE magnetohydrodynamcis simulation to understand solar winds and coronal heating • First direct numerical simulationsof Alfvén wave (AW) turbulencein extended solar atmosphereaccounting for inhomogeneities • Team • University of New Hampshire (Jean Perez and Benjamin Chandran) • ALCF (Tim Williams) • University of Oregon (SameerShende) • IRMHD (Inhomogeneous Reduced Magnetohydrodynamics) • Fortran 90 and MPI • Excellent weak and strong scaling properties • Tested on Intrepid (BG/P) and Mira (BG/Q) • HPC Source article and ALCF news https://www.alcf.anl.gov/articles/furthering-understanding-coronal-heating-and-solar-wind-origin

  34. Communication Analysis (MPI_Send, MPI_Bcast) • IRMHD demonstrated performance behavior consistent with common synchronous communication bottlenecks • Significant time spent in MPI routines • Identify problems on a 2,048-core execution on BG/P • MPI_Send andMPI_Bcast tooksignificant time • Suggest possibleopportunities foroverlappingcomputation and communication • Identified possible targets for computation improvements MPI_Barrier MPI_Send

  35. Effects of IRMHD Optimizations • Developed a non-blocking communication substrate • Deployed a more efficient implementation of theunderlying FFT library • Overall executiontime reduced from528.18 core hours to70.85 core hours(>7x improvement) fora 2,048-processorexecution on Intrepid • Further improvementon Mira …

  36. Mira (BG/Q) Performance • Test with 32K MPI ranks • Load imbalance apparent • See imbalance reflected inMPI_Alltoall() performance • MPI_Barrier isnot needed incertain regions • 6% speedupwhen remove • Oversubscribenodes next …

  37. Oversubscribing Mira Nodes • Vary #MPI ranks on nodes (1024 nodes) • 16 ranks per node (16K) versus 32 ranks per node (32K) • Overall timeimprovement • 71.23% oforiginal • More efficientbarriers withinnode leads toperformanceimprovement

  38. Performance Variability in CESM • Community Earth System Model (CESM) • Observed performance variability on ORNL Jaguar • Significant increase in execution time led to failed runs • End-to-end analysis methodology • Collect information from allproduction jobs • modify jobs scripts • problem/code provenance,system topology, systemworkload, process node/core mapping, job progress, timespent in queue, pre/post, total • Load in TAUdb, qualtify nature of variability and impact

  39. Example Experiment Case Analysis • 4096 processor cores • 6 hour request • Target: 150 simulation days • 35 jobs • May 15 – Jun 29, 2012 • Two of which failed Minimum execution time Maximum execution time High variability intra-run

  40. Node Placement and MPI Placement Matter • Node placement can cause poor resource matching • Fastest • 0 unmatched • Slowest • 272 unmatched • Can cause morerisk of contention with other jobs executing • MPI rank placement also matters • Greaterdistancesbetweenprocessescan result in largeroverheads forcommunication 5:35:50 (hh:mm:ss) 4:14:08 (hh:mm:ss)

  41. Evolution • Increased performance complexity and scale forces the engineering process to be more intelligent and automated • Automate performance data analysis / mining / learning • Automated performance problem identification • Even with intelligent and application-specific tools, the decisions of what to analyze are difficult • Performance engineering tools and practice must incorporate a performance knowledge discovery process • Model-oriented knowledge • Computational semantics of the application • Symbolic models for algorithms • Performance models for system architectures / components • Application developers can be more directly involved in the performance engineering process

  42. Need for Whole Performance Evaluation • Extreme scale performance is an optimized orchestration • Application, processor, memory, network, I/O • Reductionist approaches to performance will be unable to support optimization and productivity objectives • Application-level only performance view is myopic • Interplay of hardware, software, and system components • Ultimately determines how performance is delivered • Performance should be evaluated in toto • Application and system components • Understand effects of performance interactions • Identify opportunities for optimization across levels • Need whole performance evaluation practice

  43. “Extreme” Performance Engineering • Empirical performance data evaluated with respect to performance expectations at various levels of abstraction

  44. TAU Integration with Empirical Autotuning • Autotuningis a performance engineering process • Automated experimentation and performance testing • Guided optimization by (intelligent) search space exploration • Model-based (domain-specific) computational semantics • Autotuningcomponents • Active Harmony autotuning system (Hollingsworth, UMD) • software architecture for optimization and adaptation • CHiLL compiler framework (Hall, Utah) • CPU and GPU code transformations for optimization • Orio annotation-based autotuning (Norris, ANL) • code transformation (C, Fortran, CUDA, OpenCL) with optimization • Goal is to integrate TAU with existing autotuning frameworks • Use TAU to gather performance data for autotuning/specialization • Store performance data with metadata for each experiment variant and store in performance database (TAUdb) • Use machine learning and data mining to increase the level of automation of autotuning and specialization

  45. Orio and TAU Integration OpenCL CUDA • Measurement • Metric profiling • Metadata • TAUdb storage • Autotuning analysis • Machine learning • Optimization search • specialization

  46. Revolution • The first person approach is problematic for exascale use • Highly concurrent and dynamic execution model • post-mortem analysis of low-level data prohibitive • Interactions with scarce and shared resources • introduces bottlenecks and queues on chip/node and between nodes • Multiple objectives (performance, energy, resilience, …) • Runtime adaptation to address dynamic variability • Third person measurement model (in addition) required • Focus is on system activity characterization at different levels • system resource usage is a primary concern • Measurements are analyzed relative to contributors • Online analysis and availability of performance allows introspective adaptation for objective evaluation and tuning

  47. A New “Performance” Observability • Exascale requires a fundamentally different “performance” observability paradigm • Designed specifically to support introspective adaptation • Reflective of computation model mapped to execution model • Aware of multiple objectives (“performance”) • system-level resource utilization data and analysis, energy consumption, and health information available online • Key parallel “performance” abstraction • Inherent state of exascale execution is dynamic • Embodies non-stationarity of “performance” • Constantly shaped by the adaptation of resources to meet computational needs and optimize execution objectives

  48. Needs Integration in Exascale Software Stack • Exascaleobservability framework can be specialized through top-down and bottom-up programming specific to application • Enables top-down application transformations to be optimized for runtime and system layers by feeding back dynamic information about HW/SW software resources from bottom-up • Exascale programming methodology can be opened up to include observability awareness and adaptability • Programming system exposes alternatives to the exascale system • parameters, algorithms, parallelism control, … • Runtime state awareness can be coupled with application knowledge for self-adaptive, closed-loop runtime tuning • richer contextualization and attribution of performance state • Performance portability through dynamic adaptivity • Will likely require support in OS and runtime environment

  49. APEX – Applying TAU to ParalleX • ParalleX execution model • Highly concurrent • Asynchonous • Message driven • Global address space • OpenX • Integrated software stackfor ParalleX • HPX runtime system • RIOS interface to OS • APEX autonomicperformance environment • Cores interact throughshared resources on chip(node) and between nodes • Top-down and bottom-up

  50. APEX Prototyping with TAU • Map HPX performanceto APEX (TAU) • Track user-level threads • Reimplement Intel® ITTinstrumentation of HPXruntime (thread scheduler) • Use RCR client to observe core use and power HPX mainprocess thread HPX “helper” OS threads HPX thread scheduler HPX userlevel

More Related