190 likes | 332 Views
Monitoring and performance measurement in Production Grid Environments. David Wallom. Overview. Who uses monitoring? Aspects of performance measurement Tools for monitoring Adding a new service into a monitoring framework. Who are the consumers of monitoring?. Grid/VO management
E N D
Monitoring and performance measurement in Production Grid Environments David Wallom
Overview • Who uses monitoring? • Aspects of performance measurement • Tools for monitoring • Adding a new service into a monitoring framework
Who are the consumers of monitoring? • Grid/VO management • Responsible for designing & maintaining requirements • Verify fulfillment of SLAs by resource providers • System administrators • Notified of problems • Enough information to understand context of problem • End users • View results and compare to problems they are having • Debug user account/environment issues • Advanced users: feedback to Grid/VO
Monitoring from a user perspective • Things that need to work for the Grid? • Can I login? • Is my application[s] available on connected systems? • Can I get to my input data? • What credentials do I need? • Can I get the input data to the application? • How long will my application take to run? • …
Performance Measurement • Depends on monitoring of; • Availability • Usage
Measuring Availability • Test the following grid functionality • User authorization • System information publishing • Data transfer to and from system • Submission of tasks onto the system • Measurement of other functionality • Type of system
Measuring Usage • Within each system need to know; • Current load • e.g. queue lengths, number of running processes on an SMP system • Knowledge of network connectivity • Total throughput rate for a submitted user job
Tools for monitoring availability • Systems status • Grid status • All system and grid status monitoring
Ganglia • Developed out of HPC community, • Will monitor worker as well as system head nodes, • Can have sub nodes reporting to a master to create grid monitoring, • Example: • http://oxgrid-vom.ierc.ox.ac.uk/ganglia/
Big Brother • Designed to monitor individual systems, • Simple interface giving immediate feedback on overall system status, • Different providers can be added for additional services such as different process to be monitored etc. • Can be difficult to look at historical trends though, • Example; • http://cerb-mds.bris.ac.uk/bb/bb.html
Grid Interoperability Test Scripts • Developed by Southampton e-Science Centre, • Tests in series each of the standard grid functionalities for a specified node • Wrapper to test in parallel many systems • Example of the results • http://www.ngs.ac.uk/ops/gits/oxford/NationalGridService.html
INCA • Developed by SDSC and TeraGrid • Extensible framework for monitoring • Tests the following as standard • Static system information • Installed software versions • Network performance • Load both on head and queue system if available • Additionally the UK NGS has developed a plug-in for the GITS tests. • Example • http://inca.grid-support.ac.uk/
Testing the behaviour of a Grid • Define a set of concrete requirements for connected systems • Write tests to verify requirements • Periodically run tests and collect data across all of the system • Publish data and archive for reporting • Automate Steps 3 and 4 to provide real time system status information
Connecting to existing production systems • Determine monitoring requirements for systems to be connected • Write independent tests for service being provided. • Write information providers to fit tests into existing monitoring frameworks
Conclusions • Monitoring must be based on a well known set of requirements for admins (both VO and systems) & users • There are several products available to provide monitoring frameworks, each can be extended beyond initial capabilities • Life would be made a lot simpler if there was a standard monitoring schema which could then be used to plug-in grid and system information into all monitoring frameworks!