290 likes | 450 Views
GridPP Monitoring & Accounting. Dave Kant CCLRC, e-Science Centre. Monitoring Overview`. Overview How Many Jobs on the Grid? LCG/EGEE Monitoring System Putting it all together for GridPP Future Plans. How Many Jobs on the Grid?.
E N D
GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre
Monitoring Overview` • Overview • How Many Jobs on the Grid? • LCG/EGEE Monitoring System • Putting it all together for GridPP • Future Plans EGEE’03, April 2005 - 2
How Many Jobs on the Grid? • As a way to introduce the various tools that are in development in the LCG/EGEE Grid • There are different sources for getting estimates about the number of Jobs. • Information System • Accounting System • Resource Brokers EGEE’03, April 2005 - 3
How Many Jobs on the Grid? • One source of information is the monitoring system based on R-GMA • Tools which gather information and use the R-GMA backbone for data collection • GIIS Monitor • Apel • Site Functional Tests • Tools which create reports • RB Logging&Bookkeeping data mining • Accounting EGEE’03, April 2005 - 4
http://goc.grid.sinica.edu.tw/gstat/GIIS Monitor • GIIS Monitor developed by GOC Taipei (Min Tsai) • Tool to display and check information published by the site GIIS • Sanity checks, fault detection of information system every 5 minutes • Provides an instantaneous snapshot of the number of Jobs EGEE’03, April 2005 - 5
How Many Jobs on the Grid? • Another source of information is the accounting, which as so many sources, is not complete, but covers most of the resources. • This is not the case for GridPP resources. • Accounting information is based on resource usage published by batch servers EGEE’03, April 2005 - 6
How Many Jobs on the Grid? Latest source is a data mining tool which can be used to examine RB Logging and Bookkeeping information (via R-GMA) at the user level. https://lxn1192.cern.ch:9443/~judit/job-monitor.cgi EGEE’03, April 2005 - 7
How Many Jobs on the Grid? • A further source is based on the work by the EGEE QA Team • They monitor several – but not all – resource brokers on LCG and create reports of their usage. • http://egee-jra2.web.cern.ch/EGEE-JRA2/index.html • Statisticts based on aggregated information • Job Success and job throughput per VO and per RB • Grid efficiency (Execution time vs Waiting Time) EGEE’03, April 2005 - 8
How Many Jobs on the Grid? EGEE’03, April 2005 - 9
How Many Jobs on the Grid? • Job Duration showing a dominance of Dteam and LHCb jobs which are relatively short lived. EGEE’03, April 2005 - 10
Site Functional Tests • Installation and configuration of a site is quite a complicated procedure. • -When there is a new release, sites don’t upgrade at the same time. • -Some upgrades don’t always go smoothly • -Unexpected things happen (who turned of the power?) • -Day-to-day problems; robustness of service under load? • SFT framework consists of a number of tests which probe a site to determine the operational status. • This includes all certified sites in EGEE/LCG infrastructure but also testing uncertified sites (for internal certification process performed by ROCs), monitoring sites that are part of gLite Pre-Production Service, and all other sites that are using LCG or gLite middleware EGEE’03, April 2005 - 11
SFT • SFT runs every 3 hours and writes test results to a database using R-GMA Site summaries and histories SFT used by ROCs for certification Grid–Ireland SFT EGEE’03, April 2005 - 12
http://map.gridpp.ac.uk/GridPP Monitoring Map GPPMon is a lightweight test which sends a simple job to GridPP resources every hour. Links hourly job submission test results to SFT, GSTAT, RSS Feeds and Accounting data EGEE’03, April 2005 - 13
Future Plans for GPPMon • GPPMON - GridPP monitor to be switched off • SFT2 runs every 3 hours and sites/ROCS can run these tests independently, so there is no real need for these jobs. • Proposal is to link GridPP monitoring map to the monitoring data in the R-GMA and make use of changes to the grid M/W e.g. support for longitude and latitude in Glue Schema (LCG 2.6). • Google Map EGEE’03, April 2005 - 14
http://goc03.grid-support.ac.uk/googlemaps/gridpp.html Google Map EGEE’03, April 2005 - 15
Accounting Overview This is a summary of the status of Accounting & Reporting following its deployment in LCG2_6 • Overview • APEL Design • What’s New? • LCG Accounting (OSG , NorduGrid, EGEE) • Issues EGEE’03, April 2005 - 16
Accounting Flow Diagram EGEE’03, April 2005 - 21
Accounting Home Page http://goc.grid-support.ac.uk// 107 Sites publishing data (Sep 02 2005) Over 3.3 Million Job records ~ 100K records per week (period June 1st – mid Aug 2005)
What’s New? • Added GridPP View to the reporting interface • Requirements driven by GridPP • Global view of entire organisation • Tier-2 Summaries • Detailed view at Site level • CSV download of information • Toggle between Normalised / Un-normalised Datasets http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.html EGEE’03, April 2005 - 26
GridPP Input • GridPP Metrics and Deployment Document (J.Coles) • Metric 10:Number of sites publishing accounting data at the end of the last quarter • Metric 11:KSI2K hours of CPU processing delivered (per VO) over the last quarter • We are looking for meaningful plots that allow important conclusions to be drawn without misleading people • Is Job Efficiency meaningful? • Sites treat their data in different ways:- • At Tier-1 WCT are scaled because of the scheduler • At other sites, only system time is scaled • What about Hyper threading? • Perhaps we need to provide descriptive text against each plot to warn of such problems? • Spot potential problems in resource allocation • Identify trends EGEE’03, April 2005 - 27
GridPP View Screen Shots Atlas dominates in Tier1 Atlas and LHCb dominating KSI2K delivered per Tier1/Tier2 per VO Job Efficiency = CPUT/WCT Why is atlas EFF at 60%? Why is DZERO EFF for MANHEP > 1 ?
Site View (Lancaster) Breakdown of data per Vo per month showing Njobs, CPUt, WCT, record history Total CPU Usage per VO Gantt Chart NB:Gaps across all VOs consistent with scheduled downdowns in GocDB
APEL IN LCG 2.6 • New version with better documentation • APEL supports PBS and LSF • Consists of a number of components • Core module contains functionality common to all components • Plugin components provide log parsing functionality for PBS and LSF job managers. EGEE’03, April 2005 - 32
Accounting Dissemination • CERN Courier • LCG Computing Newsletter (slightly more technical) • AHM 2005 (more technical still) EGEE’03, April 2005 - 33
APEL and gLite • Is APEL integrated in g-Lite? • Work currently in progress. • We have ported the APEL code into the gLite CVS repository but need to understand functional differences e.g. WMS and use of Condor • What about its development plan? • Future unclear given presence of DGAS in gLite • Areas of possible development: • Condor (easy or complicated) • Reporting Tool (GridICE will most likely provide this) EGEE’03, April 2005 - 40
LCG Accounting Project involves combining results from all three infrastructures and presenting an aggregated view • Peer Infrastructures in LCG • Open Science Grid (Ruth Pordes, Philippe Canal, Matteo Melani) • Nordugrid (Per Oster) • EGEE • Currently, LHCView filters LHC VO data from EGEE accounting data. EGEE’03, April 2005 - 41
Requirements Combine results from all three infrastructures … • Ideally: Distributed queries to multiple databases • Each peer manages an accounting database • LHC VO filtering provided through a web services interface • Initial Implementation: Centralised Collection • Peers publish data into a global database • WebServices or direct MySql inserts Common Problem: Different Grid infrastructures may use different Schemas. GGF define a schema, but quite flexible. May need “translators” to convert from one schema to another. (already exist) EGEE’03, April 2005 - 42