1 / 41

Service Level metrics- Monitoring and Reporting

This workshop in Mumbai provides an overview of current metrics and uses, work in progress, visualization examples, and discussion points regarding service level metrics. It focuses on measuring services delivered against MoU targets, data transfer rates, service availability, time to resolve problems, and resource availability.

Download Presentation

Service Level metrics- Monitoring and Reporting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Service Level metrics- Monitoring and Reporting Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  2. Talk Overview • Objectives • Current metrics and uses • Work in progress • Visualisation Examples • Timescale • Discussion Points • A synthesis of the work of teams at CERN/IT/GD, BARC, IN2P3, ASGC • I have plundered slides from other talks, papers and screen dumps. Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  3. Program of work objectives • Provide high level (managerial) views of the current status and history of the Tier 0 and Tier 1 grid services • Keep this simple and provide a first result quickly (end of 1Q 2006) • Plan the views to be qualitative and quantitative so they can be compared to the LCG site Memoranda of Understanding service targets for each site. • Satisfy the recent LHCC LCG review observation: • ‘A set of Baseline Services provided by the LCG Project to the experiments has been developed since the previous comprehensive review. A measure of evaluating the performance and reliability of these services is being implemented.’ Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  4. The MoU Service Targets • These define the (high level) services that must be provided by the different Tiers • They also define average availability targets and intervention / resolution times for downtime & degradation • These differ from Tier to Tier (less stringent as N increases) but refer to the ‘compound services’, such as “acceptance of raw data from the Tier0 during accelerator operation” • Thus they depend on the availability of specific components – managed storage, reliable file transfer service, database services, … • An objective is to measure the services we deliver against the MoU targets • Data transfer rates • Service availability and time to resolve problems • Resources available at a site (as well as measured usages) • Resources specified in the MoU are cpu capacity (in KSi2K), tbyes of disk storage, tbytes of tape storage and nominal WAN data rates (Mbytes/sec) Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  5. MoU Minimum Levels of Service Some of these imply weekend/overnight staff presence or at least availability

  6. Tier 0 Resource plan from MoU Similar tables exist for all Tier 1 sites. Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  7. Nominal MoU pp running data rates CERN to Tier 1 per VO These rates must be sustained to tape 24 hours a day, 100 days a year. Extra capacity is required to cater for backlogs / peaks.

  8. WLCG Component Service Level Definitions • Reduced defines the time between the start of the problem and the restoration of a reduced capacity service (i.e. >50%) • Degraded defines the time between the start of the problem and the restoration of a degraded capacity service (i.e. >80%) • Downtime defines the time between the start of a problem and restoration of service at minimal capacity (i.e. basic function but capacity < 50%) • Availability defines the sum of the time that the service is down compared with the total time during the calendar period for the service. Site wide failures are not considered as part of the availability calculations. 99% means a service can be down up to 3.6 days a year in total. 98% means up to a week in total. • None means the service is running unattended

  9. Tier0 Services

  10. Required Tier1 Services Many also run e.g. an RB etc.

  11. WLCG Service Monitoring framework • Service Availability Monitoring Environment (SAME) - uniform platform for monitoring all core services based on SFT (Site Functional Tests) experience. Other data sources can be used. • Three main end users (and use cases): • project management - overall metrics • Operators (Local and Core infrastructure Centres) - alarms, detailed information for debugging, problem tracking • VO administrators - VO specific SFT tests, VO resource usage • A lot of work already done: • SFT and GStat are monitoring CEs and Site-BDIIs and will feed data into SAME • GRIDVIEW is monitoring data transfer rates and job statistics • A SAME Data schema has been established. There will be a common Oracle database with that of GRIDVIEW • Basic displays in place (SFT reports, CIC-on-duty dashboard, GStat) (but with inconsistent presentation and points of access) • Basic framework for metric visualization in GRIDVIEW is ready Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  12. Current Site Functional Tests (SFT) • Submits a short-living batch job to all sites to test various aspects of functionality: site CE uses local WMS to schedule them on an available local worker node • Security tests: CA certificates, CRLs, ... (work in progress) • Data management tests: basic operations using lcg-utils on the default Storage Element (SE) and chosen “central” SE (usually at CERN - 3rd party replication tests) • Basic environment tests: software version, BrokerInfo, CSH scripts • VO environment tests: tag management, software installation directory + VO specific job submission and tests (maintained by VOs, example: LHCb dirac environment test) • Job submission - CE availability: Can I submit a job and retrieve the output? • What is not covered? • Only CEs and batch farms are really tested - other services are tested indirectly: SEs, RB, top-level BDII, RLS/LFC, R-GMA registry/schema tested by using default servers whereas sites may have several • Maintained by Operations Team at CERN (+several external developers) and run as a cron job on CERN LXPLUS (using AFS based LCG UI) • Jobs submitted at least every 3 hours (+ on demand resubmissions) • Tests have a criticality attribute. These can be defined by VOs for their tests using the FCR(freedom of choice for resources) tool. The production SFT levels are set by the dteam VO. Attributes are stored in the SAME database. • Results displayed at https://lcg-sft.cern.ch:9443/sft/lastreport.cgi

  13. Description of current SFT tests (1 of 2) • If a test is defined as critical the failure causes the site to be marked as bad on the SFT report page. If the test is not critical only a warning message is displayed. • Job Submission • Symbolic name: sft-job Critical: YES • Description: The result of test job submission and output retrieval. Succeeds only if the job finished successfully and the output was retrieved. • WN hostname • Symbolic name: sft-wn Critical: NO • Description: Host name of the Worker Node where the test job was executed. This test is only for information. • Software Version • Symbolic name: sft-softver Critical: YES • Description: Detect the version of software which is really installed on the WN. To detect the version lcg-version command is used and if the command is not available (very old versions of LCG) the test script checks only the version number of GFAL-client RPM. • CA RPMs Version (Certificate Authority RPMs) • Symbolic name: sft-caver Critical: YES • Description: Check the version of CA RPMs which are installed on the WN and compare them with the reference ones. If for any reason RPM check fails (other installation method) fall back to physical files test (MD5 checksum comparison for all CE certs with the reference list). Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  14. Description of current SFT Tests (2 of 2) • . BrokerInfo • Symbolic name: sft-brokerinfo Critical: YES • Description: Try to execute edg-brokerinfo -v getCE. • R-GMA client (Relational Grid Monitoring Architecture) • Symbolic name: sft-rgma Critical: NO • Description: Test R-GMA client software configuration by executing rgma-client-check utility script. • CSH test • Symbolic name: sft-csh Critical: YES • Description: Try to create and execute a very simple CSH script which dumps environment to a file. Fails if CSH script is unable to execute and the dump file is missing. • Apel test (Accounting log parser and publisher) • Symbolic name: sft-apel Critical: NO • Description: Check if Apel is publishing accounting date for the site by using the command: rgma -c "select count(*) from LcgRecords" Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  15. SFT Tests can be aggregates • Test: Replication management using LCG tools • Test symbolic name: sft-lcg-rm Critical: YES • Description: This is a super-test that succeeds only if all of the following tests succeed. • GFAL infosys • Description: Check if LCG_GFAL_INFOSYS variable is set and if the top-level BDII can be reached and queried. • lcg-cr to defaultSE • Description: Copy and register a short text file to the default SE using lcg-cr commamnd. • lcg-cp defaultSE -> WN • Description: Copy the file registered in test 8-2 back to the WN using lcg-cp commamnd. • lcg-rep defaultSE -> central SE • Description: Replicate the file registered in test 8-2 to the chosen "central" SE using lcg-rep commamnd. • 3rd party lcg-cr -> central SE • Description: Copy and register a short text file to the chosen "central" SE using lcg-cr commamnd. • 3rd Party lcg-cp central SE to WN • Description: Copy the file registered in test 8-5 to the WN using lcg-cp commamnd. • 3rd Party lcg-rep central SE to defaultSE • Description: Replicate the file registered in test 8-5 to the default SE using lcg-rep commamnd. • Replication Management using lcg-tools • lcg-del from defaultSE • Description: Delete replicas of all the files registered in previous tests using lcg-del command. Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  16. SFT results display A CIC operational view – not intended to be a management overview Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  17. GRIDVIEW: A Visualisation Tool for LCG • An umbrella tool for visualisation which can provide a high level view of various grid resources and functional aspects of the LCG. • To display a dash-board for the entire grid and provide a summary of various metrics monitored by different tools at various sites. • To be used to notify fault situations to grid operations, user defined events to VOs, by site and network administrators to view metrics for their sites and VO administrators to see resource availability and usage for their VOs. • GRIDVIEW is currently monitoring gridftp data transfer rates ( used extensively during SC3 ) and job statistics • The Basic framework for metric visualization by representing grid sites on a world map is ready • We propose to extend GRIDVIEW to satisfy our service metrics requirements. Start with simple service status displays of the services required at each Tier 0 and Tier 1 site. Extend to service quality metrics, including availability and down times, and quantitative metrics that allow comparison with LCG site MoU values. Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  18. LCG Sites and Monitoring Tools R-GMA Oracle Database at CERN Archiver Module (R-GMA Cmr) Presentation Module Data Analysis & Summarization Module GUI & Visualization Module Block Schematic of Gridview Architecture (courtesy of BARC) Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  19. GRIDVIEW gridftp display: data rates from CERN to Tier 1 sites Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  20. GRIDVIEW Visualisation : CPU Status (Red busy, Green Free) Simulated Data Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  21. GRIDVIEW Visualisation : Fault Status (Green OK, Red Faulty,Height of cylinder indicates size of site) Simulated Data Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  22. Work in Progress • Implement in Oracle at CERN the SAME database schema for storage of SFT results recently agreed between CERN and IN2P3 • Obtain raw test results as below for the required services on a site - with different service-dependent frequencies (cron job, configurable)- each result defined by: site, service node, VO, test name- each result gives: status, summary (values), details (log) • Tests under Development or to be integrated in SAME are: • SRM:Not clearly defined yet, but should be:- OK/FAILED for most typical operations • LFC:- OK/FAILED for several tests like: (un)registering file, create/ remove/list directory.- probably also measurements of time to perform the operation • FTS:Not clearly defined yet. The existing gridftp rates from GRIDVIEW will also be used in quantity/quality measures. • CE:- OK/FAILED for number of tests in SFT (job submission, lcg-utils,  rgma, software version, experiment specific tests) Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  23. Other SFT Tests under development or to be integrated • RB:- "active" sensor - probe: OK/FAILED for a simple test job, time to  get from "Submitted" to "Scheduled" state (performance)- "passive" sensor – a summary from the real jobs monitoring being developed in GridView. Can produce a quality metric from the job success rate – e.g. above 95% = normal, down to 80% = degraded, down to 50 = affected, down to 20% = severely affected, below 20% = failed. • Top-level BDII:- OK/FAILED for BDII (try to connect and make a query)- response time to execute full query- generate a quality metric. • Site-BDII:- as for Top-level BDII- additionally OK/FAILED for a number of sanity checks • MyProxy:- OK/FAILED for several simple tests: store proxy, renew proxy- possibly response time measurements- generate a quality metric. • VOMS:- OK/FAILED for voms-proxy-init and maybe few other operations Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  24. SAME database schema Test definition table Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  25. SAME database schema • Test data table:The data table is where all the results produced by a tests and sensors can be stored. Moreover it will contain also the results for summary tests generated by GRIDVIEW Summarisation Module. Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  26. Responsibilities for new Tests Extend the SFT tests to include all baseline Tier 0 and Tier 1 service components. Store the results in the SAME Oracle data base Coordination of sensors: Piotr Nyczyk (CERN)

  27. Proposed extensions to GRIDVIEW • Enhance the GRIDVIEW summarisation module to aggregate individual SFT results into service results • Start with simple logical and of the success/failure of all the critical tests of a given service to define the overall success/failure of that service. Store status ‘success/fail/not known’ back in database with timestamp (not known allows for no result from some tests). • Also Store quantitative results from SFT tests, and other sources, of the MoU defined resources and other interesting metrics. • Cpu capacity • Disk storage under SEs • Batch job numbers in site queue, running, success rate (per VO) • Down time of services failing SFT tests (including scheduled downtimes) Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  28. Further extensions to GRIDVIEW • Allow for dependencies • Network failures to site or in site • Component masking – failed component causes others to appear to fail their tests • Failure of part of the SFT service itself (based on 3 servers at CERN) • Generate quality rating metrics • Use percent of nominal MoU value for cpu capacity available, disk storage under SEs, average downtime per service per degradation level and averaged annual availability per service in date range to calculate quality levels. • Have 4 to 5 quality levels for simple dashboard visualisation. • Above 80% of MoU target = acceptable, display green • 50% to 80% of MoU target = degraded, display yellow • 20% to 50% of Mou target = affected, display orange • 0% to 20% of Mou target = failing, display red • Aggregate to global site quality – take the lowest value quality level or perhaps an importance weighted average. Use last few months average availability rather than whole year. Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  29. Extend GRIDVIEW visualisation • Add visualisation of a per site service status dashboard • Screen for each site • Simple red/(orange/yellow/)green button per service (LFC per VO) • Click on service button to display service status history • Links to full SFT reports • Links to other GRIDVIEW displays (includes VO breakdowns) • gridftp data rates from CERN • Advanced job monitoring • Display of quality metrics – instantaneous and drill down to history with time period selection • Percentages of MoU targets • Mean Times to repair for each service • Average availability per service • Screen showing global site quality for all Tier 0 and Tier 1 – instantaneous and drill down to history with time period selection Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  30. Prototype CPU availability display • These metrics show the availability of resources (currently only CPUs) and sites which provide them. Availability means that a site which provides any resources passes all critical tests. The current list of critical tests includes all critical tests from SFT and additionally several essential tests from GStat. The metrics are presented as cumulated bar charts for averaged metrics and as cumulated line plots for the history. • Plots show in red the number of CPUs or sites that were "unavailable", integrated over a given interval. For CPUs the total amount of CPUs published as being provided by a site is used if the site was failing any of the critical tests, and 0 if no critical tests failed. • Plots show in green the number of CPUs or sites that were "available", integrated over a given interval. For CPUs the total amount of CPUs published as being provided by a site is used if the site was passing all critical tests, and 0 if any critical test failed. • The metrics are presented for the whole grid and for individual regions. The daily value of the number of sites available is integrated over the day and a site with some downtime contributes the corresponding fractional availability. • To convert to KSi2K we can use the worker node rating published by each CE (grid middleware assumes homogeneity under a CE so we need an average). Currently these values do not seem to be universally reliable. They could be compared with the post-job accounting data collected at RAL (assuming they are better calculated). Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  31. Work in Progress Prototype sites cpu availability metric

  32. CERN FIO Group Service View Project • Project to provide a web based tool that dynamically shows availability, basic information and/or statistics about IT services as well as dependencies between them. • Targeted on local CERN services: CASTOR, Remedy (trouble tickets), AFS, Backup and Restore, production clusters. • Service aggregation to understand dependencies e.g. CVS depends on AFS or a failed component makes other tests appear to fail. • Will have quality thresholds – envisage at least four levels – fully available, degraded, affected, unavailable e.g 60-80% of batch nodes available is interpreted as a degraded batch service. • Will include history displays and calculate service availabilities and down times. • Simple displays of service level e.g. red, orange, yellow, green. • GRIDVIEW summarisation could become an information provider. • Projects following each others work looking for possible synergies. Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  33. CERN FIO Group Service level views planning Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  34. Example visualisation from CERN AIS – multiple tests per service Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  35. Example visualisation from CERN windows services Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  36. Spider plot visualisation Normalised quasi-independent metrics. All values reaching their targets would be a circle. Colour the plot by the quality level of the worst service. Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  37. Spider plot including some history Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  38. OSG visualisation: Global health spider plot From http://osg.ccr.buffalo.edu/grid-dashboard.php on 3 Feb 2006 Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  39. OSG Visualisation: Site resources Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  40. Timescale • Updated framework and schema (SAME): • Available now for testing of integration of data sources • Integration of new SFT tests – after CHEP06 so from end Feb as they become available • Displays: • GridView: we would like to produce first version of status dashboards in parallel with SFT test integration. Keep lightweight in view of FIO work. Planning a top level Tier 0 screen and one for all Tier 1 with drill down to individual Tier 1 sites and their individual SFT test histories • FIO tool: expected at end March • Investigate its use as a back end • Goal is to have first version of dashboard displays available in production by the end of March Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

  41. Some discussion points • Can we improve the consistency of the various grid monitoring interfaces: • GRIDVIEW access point: http://gridview.cern.ch/GRIDVIEW/ • Grid Operations Centre access point http://goc.grid-support.ac.uk/gridsite/monitoring/ • Which includes links to • Gstat via http://goc.grid.sinica.edu.tw/gstat/ • SFT reports via https://lcg-sft.cern.ch:9443/sft/lastreport.cgi • Gridice at http://gridice2.cnaf.infn.it:50080/gridice/site/site.php • Who should use which monitoring displays and for what ? • Should we try to join the CERN FIO service view framework ? • Should scheduled downtime be counted as non-available time ? • How many levels of quality of service do we need ? • What threshold values to define the quality of a service e.g. for • Time for a bdii to respond to an ldap query • Time for an RB to matchmake (find a suitable CE) • Time for simple batch job to complete • Time to register a new LFC entry • What fraction of MoU targets is ‘acceptable’ • How to properly handle a null result for a test – reuse last value ? • How to automate calculation of average KSi2K rating under a CE ? Service Level Metrics - SC4 Workshop Mumbai: H.Renshall

More Related