Performance analysis and tuning of parallel/distributed applications

Performance analysis and tuning of parallel/distributed applications 21-08-2007 Anna Morajko Anna.Morajko@uab.es

Introduction Goalsof our group • Develop techniques and tools for application performance improvement • Automatic analysis • Automatic and dynamic tuning • For parallel/distributed applications • For dedicated and non dedicated clusters as well as for grids 4 doctors, 5 graduate students, 2 extern doctors

Introduction Main research projects • Monitoring, analysis, and tuning tools: • Kappa-Pi – post-mortem analysis • DMA – automatic analysis at run-time • MATE – automatic tuning at run-time • Performance models • For applications based on frameworks • For scientific libraries

Outline • Performance analysis and tuning • Post-mortem analysis • Run-time analysis • Performance models • Framework-based applications • Scientific libraries • MATE • Future work

Problems / Solutions Metrics Trace Profile data Performance analysis and tuning Post-mortem manual analysis Modifications Tuning Application development Source Application User Performance analysis Execution Performance data Monitoring Tools Events Visualization

Metrics Trace Profile data Performance analysis and tuning Post-mortem automatic analysis Modifications Tuning Application development Source Application User Execution Kappa-Pi Performance data Suggestions for user Monitoring Tools Performance analysis Events

Performance analysis and tuning Run-time analysis Modifications Tuning Application development Source Application User Execution DMA Dynamic Monitor and Analyzer Performance data Instrumentation Problems and root causes Monitoring Tools Metrics Performance analysis

Problems / Solutions MATE Monitoring Analysis and Tuning Environment Performance analysis and tuning Dynamic tuning Application development Source Application User Execution Modifications Performance data Instrumentation Monitoring Tuning Tools Events Performance analysis

What can be tuned in an application? Application code API Framework code API Libraries code OS API Operating System kernel Hardware Performance analysis and tuning

Knowledge representation Measure points Where the instrumentation must be inserted to provide measurements Performance model Determines minimal execution time of the entire application Tuning points/actions/synchronization What and when can be changed in the application point – element that may be changed action – what operation to invoke on a point synchronization – when a tuning action can be invoked to ensure application correctness Formulasand conditions for optimal behavior measurements optimal values Performance analysis and tuning

Application code API Framework code API Libraries code OS API Operating System kernel Hardware Performance analysis and tuning Execution MATE Measure points Monitoring Tuning Performance analysis Performance model Tuning point, action, sync

Post-mortem performance analysis Kappa-Pi tool • Knowledge-Based Automatic Parallel Program Analyser for Performance Improvement • Trace analyzer that automatically detects bottlenecks, relates them to source code and provides actionable suggestions for a user • Version 1 • Rule-based inference engine • Hard coded performance problems • Focused on analysis of inefficiency intervals • Basic source code analysis

Post-mortem performance analysis • Statistical problem instances classification • Suggestions guided by static source code analysis Kappa-Pi 2 • More recent development, works with PVM and MPI • Declarative communication problems catalog (XML) • Problems declared as structured event patterns • Detection based on decision trees

Post-mortem performance analysis e2 e1 e4 e3 Root Node (Late Sender & Blocked Sender) R S Intermediate Nodes (Late Sender) R Leaf Node (Final Node Blocked) Sender) S R (receive call) S (send call) • [Late sender] situation found in tasks [0, 1] and [1, 2], the sender primitive call can be initiated before (no data dependences) the current call on line [35] in file [ats_ls.c] Kappa-Pi 2: example

Run-time performance analysis DMA: Dynamic Monitor and Analyzer • Primary objective • Develop a tool that is able to analyze the performance of parallel applications, detect bottlenecks and explain their reasons • Our approach • Dynamic on-the-fly analysis • Automatic extraction of application structure and patterns • Root-cause analysis based on causality paths • Declarative problem catalog for explanations (performance properties) • Tool primarily targeted to MPI-based parallel programs and communication problems

Automatic performance analysis DMA: Application structure and patterns • Task Activity Graph (TAG) • Constructed dynamically and incrementally at run-time for each task • TAG nodes reflect pre-selected activities and events (e.g. MPI function calls, relevant task events and selected loops) • Edges represent sequential flow of execution (happens-before relation) • Aggregated timers/counters provide holistic view of task behavior and allow for high-level bottleneck identification • Other metrics can be attached to nodes/edges

Automatic performance analysis DMA: Global model (GTAG) • Individual TAG graphs connected by message edges (P2P, Collective) provide global application model • This can be done at run-time: sender call-path id sent with every MPI message (by dynamically extending MPI data types) • TAG allows for detection of communication bottlenecks (e.g. wait-times like late sender, late receiver) • Other classes of bottlenecks (CPU, I/O) are visible as edge latencies

Automatic performance analysis DMA: Causality paths • How to explain causes of latencies?For example: why sender is late? • Cause of waiting time between two nodesas the differences between their execution paths • We can track activities before the problem occurs on both nodes • But TAG is too generic to identify these activities as there might be different execution paths • So we need to detect executedpaths that lead to problems • We call them causality paths • We consider two sources of causality: • Sequential flow • Message flow

Automatic performance analysis DMA: Root-cause analysis • Identify bottlenecks with TAG • Profile edges for non-communication problems • Search for causes of communication latencies (nodes) by following message edges backwards (sender-receiver pairs) • Detect causality paths in sender to explain latencies in receiver • There might be multiple explanations • Work in progress

We can extract the knowledge given by frameworks to define performance models for automatic and dynamic performance tuning (MATE) Examples: M/W framework (UAB), eSkel (The Edinburg Skeleton library), llc (La Laguna Compiler) Application code Measure points API Framework code Performance model API Libraries code Tuning point, action, sync OS API Operating System kernel Hardware Performance models Framework-based applications • Frameworks help users to develop parallel applications providing APIs that hide low level details (M/W, pipeline, divide and conquer)

Performance models Framework-based application example • M/W application • Entry, Exit CompFunc() • Message size • Entry, Exit Iteration Measure points Performance model • Problems for tuning: • Load imbalance • Improper number of workers Tuning point, action, sync • Change variable • Nopt

Application code Measure points API Framework code Performance model API Libraries code OS API Tuning point, action, sync Operating System kernel Hardware Performance models Scientific libraries modeling • Libraries help users to solve domain-specific problems • Using the knowledge learned from scientific programming libraries to define performance models for automatic and dynamic performance tuning (MATE) • Examples: mathematical library PETSc (Portable Extensible Toolkit for Scientific Computation)

Performance models Scientific libraries modeling: PETSc Study of different combinations of Matrix Types, Pre-conditioners, and Solvers Application Code Other higher Level solvers PDE Solvers TS Time Stepping SNES SLES Mathematical Algorithms Draw KSP PC Data Structure Matrices Vectors Indexset Level of abstraction BLAS LAPACK MPI

Performance models Scientific libraries modeling: PETSc • PETSc supports different storage methods (dense, sparse, compressed, etc.) – no data structure is appropriate for all problems • Linear system calculations involve matrix’s preconditioner (PC) and Krylov Subspace method (KSP) • Different mathematical algorithms for preconditioner (jacobi, LU, etc.) and KSP (GMRES, CG, etc.) • Storage methods and algorithm selection can significantly affect the performance • We build a knowledge base that maps problems (different classes of matrices, matrix sizes, and other variables) to best configurations • An input matrix is compared to the knowledge base using K-Nearest-Neighbours algorithm to select the most similar problem and its configuration • Work still in progress

Models for specific apps Models for framework-based apps Models for scientific library Application code Models for any app API Framework code API Libraries code OS API Operating System kernel Hardware MATE Execution Measure points Monitoring Tuning Performance analysis Performance model Tuning point, action, sync

Analyzer Tunlet Application Performance model MATE Measure points Tuning points Framework scientific libs Tunlet Tunlet … Application Tunlet User MATE Tunlets specification • Tunlets: mechanism for specifying a performance model • Originally, each tunlet is written as C++ code library

MATE Abstractions Tunlets specification • Simplify tunlet development: • Tunlet Specification Language • Automatic Tunlet Generator

MATE Machine 1 Machine 2 MachineN pvmd pvmd pvmd AC AC AC Task1 … Task3 Task2 Task2 DMLib DMLib DMLib DMLib Actions Actions. Actions. Events Events Events Events Analyzer Analyzer machine MATE scalability • Scalability problems with centralized analysis • The volume of events gets too high

pvmd Machine n Machine 1 pvmd . ... AC Task Task AC Task 1 2 3 DMLib DMLib DMLib actions . events . events events CP CP Machine q Machine r actions data data Global Analyzer Machine p Tunlets MATE MATE scalability • Distributed-hierarchical collecting-preprocessing approach • Distribution of event collection • Preprocessing of certain calculations: cumulative and comparative operations • Cooperating processes: • Global Analyzer - performance model evaluation • Collector-Preprocessor (CP) - events collection and preprocessing

MATE G-MATE: MATE on Grid • New requirements: • Transparent process tracking and control • AC should follow application process to any cluster • Inter-cluster communication with analyzer • Lower inter-cluster event collection overhead • Inter-cluster communications generally have high latency and lower bandwidth • Remote trace events should be aggregated to trace event abstractions, saving bandwidth (Collector-Preprocessor)

MATE G-MATE: Transparent process tracking • Alternative 1: System service • Machine or Cluster can have MATE enabled as daemon that detects startup of new processes • Alternative 2: Application wrapping • AC can be binary packaged with application binary AC Task DMLib Remote Machine Remote Machine detects Dyninst new ‘Task’ Taskn create new ‘Task’ create DMLib Task AC Job submission AC control AC DMLib Analyzersubscription

G-MATE: Analyzer approach Central analyzer Collects performance data in a central site Hierarchical analyzer Local Analyzers process local data, generates abstract metrics and send to global analyzers Distributed analyzer Each analyzer evaluates its local performance data and abstract global data cooperating with the rest of analyzers MATE

Future work • DMA – new metodology under development • Development of performance models: • for applications based on different frameworks • Join both strategies: load balance and number of workers • Replication/elimination of pipeline stages • PETSc-based applications – under construction • for black-box applications • for grid applications – under construction • Development of tuning techniques

Future work • MATE’s analysis improvement • Performance model of the number of CPs • Hierarchy at the CPs level as the number of machines increases • Root cause analysis? • Instrumentation evaluation • Co-execution of tunlets • Finishing MATE for MPI applications

Thank you very much 21-08-2007 Anna Morajko Anna.Morajko@uab.es

Performance analysis and tuning of parallel/distributed applications

Performance analysis and tuning of parallel/distributed applications

Presentation Transcript

Parallel Programming in C with MPI and OpenMP

Documentum TaskSpace Performance Deep Dive

Performance Tuning in Oracle 10g Feel the Power !

Designing Distributed Applications using Mobile Agents

Microcomputer Systems 2

BeBOP: Berkeley Benchmarking and Optimization Automatic Performance Tuning of Numerical Kernels

StreamCloud: an Elastic Parallel-Distributed Stream Processing Engine

Chapter 22: Distributed Databases

Part 2 Distributed Systems 2009

Scalable Performance Optimizations for Dynamic Applications

Parallel Visualization with ParaView

Chapter 19: Distributed Databases

Distributed Systems: Distributed algorithms

BW 2.x/3.0 Performance

Distributed Parallel Computing

Distributed Databases

TAU Parallel Performance System DOD UGC 2004 Tutorial Part 3: TAU Applications and Developments

Prof. Chris Carothers Computer Science Department Lally 306 [Office Hrs: Wed, 11a.m – 1p.m]

Parallel Computing with OpenMP on distributed shared memory platforms

DISTRIBUTED COMPUTING

Zabbix Performance Tuning

Distributed Database Systems