390 likes | 693 Views
CrossGrid Approach to Application Performance Measurement and Monitoring. Marian Bubak Bartosz Baliś, Włodzimierz Funika, Roland Wismueller, Tomasz Arodź, Marcin Kurdziel, Marcin Radecki, Tomasz Szepieniec Institute of Computer Science & ACC CYFRONET, AGH, Kraków, Poland TUM Munich, Germany
E N D
CrossGrid Approach to Application Performance Measurement and Monitoring Marian Bubak Bartosz Baliś, Włodzimierz Funika, Roland Wismueller, Tomasz Arodź, Marcin Kurdziel, Marcin Radecki, Tomasz Szepieniec Institute of Computer Science & ACC CYFRONET, AGH, Kraków, Poland TUM Munich, Germany Institute for Software Science, University of Vienna, Austria www.eu-crossgrid.org
Outline • Introduction • Performance analysis of grid interactive appls • G-PM tool • Architecture • Measurements • Example of an use case • OCM-G • Motivation • Architecture • Functionality • Security • Status • Future work
Use Case: Description • Medical simulation application with visualization kernel • Simulation on different site (server) than the visualization (client) • Task • analyse performance of simulation to visualization communication
Features of Interactive Grid Computing • Run time application control • Performance data on-line • Possible effects of decisions • Access to benchmark information • Interpreting application’s behavior in heterogeneous open system • Access to infrastructure performance • Information meaningful in the context of application field • more application specific performance data • Need in on-line standard and user-defined metrics
Background • 1995 OMIS 1.0 • 1997 OMIS 2.0 • 1997 OCM for PVM clusters • 1997 OMIS Tools (Detop, Patop, …) • 1997 Collaboration LRR TUM-ICS AGH • 1999 Porting to MPI • 2000 First proposal of OCM for Grids
G-PM Tool – Objectives: • Evaluation of grid applications performance: • Providing rich set of predefined measurements • Allowing for user-defined measurements • Allowing for probe-based measurements • Providing on-line performance measurement visualization • Compliant with OMIS 2.0 monitoring standard interface
High Level Analysis Component: Classes supporting user-defined measurements G-PM Tool Architecture Common interface to pre- and user-defined measurements Performance Measurement Component: Provides predefined measurements classes User Interface and Visual Component: Measurements specification and performance visualization classes External interface to monitoring tool OCM-G based on OMIS (CG Task 3.3)
Standard Metrics (1) • Wall clock/CPU time • Total • In communication: • Send, Receive, Collective, Barier • In I/O: • Read, Write • Data volume • communication • IO • Number of library calls • communication • IO
Standard Metrics (2) • Host metrics • CPU load • Available memory • Network metrics • Load • Bandwidth • Benchmark metrics • CPU, Network
User-defined Metrics • Support for high level performance analysis • Custom metrics • Defined on the basis of standard metrics • Providing higher level of abstraction • Programmed in dedicated specification language • Probes • Special function calls inserted into source code by the programmer • Define events that can be used in definition of custom metrics • Provide a way of passing arguments to G-PM
Measurements Parametrisation • Measurements can be restricted to specific: • Objects • Sites, hosts, processes, files • Partner objects • Sites, hosts, processes • Locations in source code • Modules, functions • Time resolution • Integral, Mean value, Current value • Virtual time
Types of Measurements • Sampled measurements • Quantities that change continuously and can only be sampled at some intervals • Based on a direct query about an object by the OCM-G • Example: CPU time • Function-based measurements • Quantities that change in result of function calls, defined by the calls’ input and output parameters • Require a library instrumentation • Based on counters/integrators • Provide a hierarchy of metrics: e.g. send volume • User-defined measurements
User-defined Measurements: Metrics • Possible ways of metrics definition: • Metrics defined by an existing metrics, measured during an execution (e.g. with 2 probes) • Metrics defined by a parameter of a probe • Metrics derived from existing set of metrics via aggregation or comparison
User-defined Measurements: Example • Example of a new metrics: IO_volume_for_interaction(Process[] processes, File[] files, Region[] regions, TimeInterval currTime) { volume[p][vt] = IO_volume(p, files, regions) AT end(p, vt) - IO_volume(p, files, regions) AT begin(p, vt); globalVol[vt] = SUM(volume[p][vt] WHERE p IN processes); result = SUM(globalVol[vt] WHERE vt IN currTime); RETURN result; • Components of the metrics definition: • Two probes: begin/end of a user interaction • Standard metrics IO_volume for total disk I/O • Volume accumulated over space (p) and time (vt) • Optimization: distributed measurements
Probes for User-defined Measurements • High-level performance data • Particular, relevant events, e.g. start/end of user interaction • Associated events, e.g. start/end events in different processes – entry/exit from the same comp. phase • Data computed within application, e.g. residuum value • Instrumentation code into application code • Probe – special function call • Additional parameters for app.-specific data • The same probe for different metrics • Low overhead of inactive instrumentation
Measurement Definition Window In measurements that involve two processes, such as: „traffic between process A and process B” it specifies the second partner. Specifies which part of code should be measured e.g.:a particular function Specifies where measurement should be done e.g: on which site, host, process etc. Specifies what should be measured, e.g:Receive Volume
Use Case: Description • Medical simulation application with visualization kernel • Simulation on different site (server) than the visualization (client) • Task • analyse performance of simulation to visualization communication
Use Case: Code Instrumentation • Programmer inserts three probes: • In the source code on server: • Probe A • After server asks client to visualize frame • Probe B • After data is sent to client • In the source code on client: • Probe C • Before data is passed to graphics engine • Programmer recompiles the application
Use Case: New Metrics • Three new custom metrics: • Generate frames/sec = =1/(time between invocations of probe A) • Compression factor = =(data passed to probe C) / (sent volume between execution of probe A and probe B) • Visualisation Kernal processing time/frame = =(time interval between execution of probe A and probe B) • New metrics can be used in the same way as the built-in ones
Why OMIS / OCM-G ? • Long experience in OMIS monitoring • 150k reusable lines of OCM code already existing since 1997-1999 • Existing OMIS Tools • Relatively easy to port due to universal interface • Versatility of the approach • Monitoring services for different types of tools • Information /manipulation / event services • Extendibility • Transparency for the user • Portability, flexibility
From OCM to OCM-G • Inherited from the OCM • Core monitoring concepts • 99% of monitoring functionality • Instrumentation techniques • New in the OCM-G • Grid-enabled start-up • GSI security • Permanent service concept • Grid-specific services • Probes – support for user-defined arbitrary events • New objects – sites
SM Service Manager OCM-G LM Local Monitor Application Module AM AM OCM-G – Architecture Tool e.g. G-PM Application Process site node AP AP
Interfaces Tool • OMIS On-line Monitoring Interface Specification • Target Interface • /proc • ptrace • shared memory site SM node LM AP AM AP AM
discover discover Tool Ext. Inf. System Consumer Consumer register register Registry Producer Producer Producer Producer SM Tool Producer SM Consumer LM LM LM OCM-G and GGF’s GMA (1)
GMA OCM-G Query / Response(one or more events returned) Unconditional requests Subscribe(event stream returned) Conditional requests OCM-G and GGF’s GMA (2) • GMA defines two monitoring scenarios
Short Overview of OMIS • Target system view • hierarchical set of objects • sites, nodes, processes, threads • objects identified by tokens, e.g. n_1, p_1, etc. • Three types of services • Information • Manipulation • Event
OMIS Services • Information services • obtain information on target system • e.g. node_get_info = obtain information on nodes in the target system • Manipulation services • perform manipulations on the target system • e.g. thread_stop = stop specified threads • Event services • detect events in the target system • e.g. thread_started_libcall = detect invocations of specified functions • Information + manipulation services = actions
OMIS Requests Services are combined into two types of monitoring requests: • Unconditional requests • executed immediately and only once • Conditional requests • execute actions whenever event occurs
:thread_stop([a_1]) :thread_stop([p_1,p_2,p_3]) :thread_stop([p_1,p_2]) :thread_stop([p_3]) :thread_stop([p_4]) Stop Stop Stop Stop Distribution of a Request Tool SM SM LM LM LM AP1 AP2 AP3 AP4 node1 node2 node3 site1 site2
Transparency • Preparation of an application for monitoring straightforward • ocm mpicc -o ping ping.c • no need of manual source code instrumentation or using automatic tools needed • Start-up of the OCM-G entirely transparent • Application submitted to run as usual • mpirun -np 2 ping --ocmg-regcont --ocmg-appname “myapp” • Tools can be attached to a running application at any time
Efficiency • Selective instrumentation • Activated or deactivated on demand • Buffering and preprocessing • Data stored in a local buffer • Counters and integrators used • Only summarized information sent to the OCM-G, not a full trace • Evaluation • Monitoring overhead with excessive number of events ~ 4% • Zero overhead of inactive instrumentation
Tool find SM fork() fork() connect P2 P1 connect find SM connect SM LM LM find LM find LM node1 node1 OCM-G Start-up Sequence External LocalizationMechanism Site 1
Security Issues • OCM-G components handle multiple users, tools and applications • Authentication and authorization needed at two levels • Tool-SM – check if the user is allowed to manipulate objects • SM-LM – check if the request comes from the SM and the user authorization
Security – Solutions • LMs are user-bound • Run as user processes • Security ensured by OS mechanisms • Service Managers are permanent • Run as unprivileged processes (nobody) • User Grid Id checked internally (partial security) • Grid certificates for users, tools and SMs incorporated (ultimate security)
Status – Prototype Completed • Typical metrics • 90% of monitoring services implemented • New Grid-enabled start-up mechanism • Support for multiple applications and tools • Works on one site • Not yet permament Grid service
Integration of G-PM and OCM-G • G-PM = Grid Performance Measurement tool • OCM-G is data source for G-PM • Full integration easily achieved (OMIS!) • Measurements • CPU usage • Delay of communication • Volume of data transfer
Future work • Full set of measurements • Integration with Grid info services • Support for multiple sites • Permanent Grid service • single instance of the OCM-G • support for multiple users • Incorporate security based on GSI • Support for dynamic application behavior • migration, creation