Monitoring Grid Services

Monitoring Grid Services Yin Chen s0231189@sms.ed.ac.uk June 2003

Contents • Issues of Monitoring • Project Proposal

Issues of Monitoring • What the goals of Grid monitoring • What's the characteristics of Grid system • What may need to be Monitored • What’s the characteristics of Monitoring Data • Related Work

What the goals of Grid monitoring • The question is • Propagate errors to users/management • Performance monitoring to tune the application use the Grid more efficiently Not how to measure resources But how to deliver information to end-users and system/Grid

What's the characteristics of Grid system • Complex distributed system =>often observe unexpectedly low performance Where is the bottleneck? - application - operating system - disks - network adapters on either the sending or the receiving host - network switches, routers Experience of the Netlogger group - 40% network, 40% application, 20% host problems - application: 50% client, 50% server process problems

What's the characteristics of Grid system (cont..) • Dynamic environment • World-wide distributed environment with - high latency - frequent faults - very heterogeneous resources

What may need to be Monitored • Disk space, speed of processor, network bandwidth, CPU load, memory load, network load, network communication time, number of parallel streams, stripes TCP/IP buffer size, disk access time that includes time to copy data to or from the local hard disk on the server.[2][3] • Some of this information are relative static information while others are run-time dynamic information.

What’s the characteristics of Monitoring Data • Run-time monitoring data goes "Old" quickly • Producer should near the entities. • Rapidly and efficiently transport from producer to consumer. • Information should be explicate, e.g. by timestamps • Updates are frequent • Performance information is often stochastic

Related Work • Monitoring and Discovery Service (MDS) • Grid Monitoring Architecture (GMA) • Relational Grid Monitoring Architecture (R-GMA) • Hawkeye • Globus Heartbeat Monitor (HBM) • Network Weather Service (NWS) • GridRM

MDS Architecture

GMA Architecture

R-GMA Architecture

Hawkeye Architecture

HBM Architecture

NWS Architecture

The Global Layer of GridRM

The Local GridRM Layer

Summary and Conclusion • Varieties of different systems exist for monitoring • Each system has its own strengths and weaknesses • Tend to use standard and open components • GGF advocated architecture GMA

Summary and Conclusion (cont.) • The similarities in architecture • At the lowest level, have a sensor or other program that generates a piece of data. • Some systems allow data to be aggregated from a set of resources • At the resource level, gather together the data from several information collectors into one component • Directory component • Decentralised hierarchy structure, which have higher ability in fault tolerance • Differences in using push or pull mechanism

Project Proposal • Goal • Requirement • Architecture -- Pull Model • Specification • Implementation • Testing • Schedule

Goal • Realisation • Lightweight & Simple design • Reliability & Robustness

Architecture • What is Pull model • The monitor sends requests to the service for information. This implies repeated queries of resource attributes over some time period at a specific frequency • On the other hand in a Pushmodel the service sends out notifications to a subscribed sink.

Benefits of Pull • Less network traffic: collections initiated only from top • Has no time synchronisation problem: collect data from resources at the same time. • The server can determine the size of the file, select the appropriate alternate server, and passively control the bandwidth and storage space. • According to Globus, "push" model "generates a large amount of data and results in constant updates to the MDS. • Standard LDAP databases are not designed to handle frequent updates.

Benefits of Pull (Cont.) • The Pull model is based on distributed intelligence to the asset site - it becomes automated. • Using machine-to-machine communications with connected sensors and autonomic computing the asset does self-diagnostics, self maintain and repair, re-routes energy flows, schedules non-routine maintenance and reports on any out of the ordinary activity that poses a security threat. • IBM calls it autonomic computing where machine to machine communications take place to optimise the performance of computing and network resources.

Problems of Pull • must gathering current measurements from all resources. • if the data volume is large in real-time may cause bottleneck problem. • may be not useful in fault detection -- heartbeat events are valid only for a short time interval and should be delivered in this time constraint. • may be not useful in dynamic sensor management. • The push model is the most efficient in terms of bandwidth as requests are not sent, just responses from the service.

Monitoring Grid Services Thanks

Monitoring Grid Services