Grid Operations SA1 Status Report

Grid OperationsSA1 Status Report Maite Barroso SA1 activity leader CERN EGEE-III First Review, 24-25 June, 2009

SA1 Activity Overview 28 countries, 175 FTE SA1 – Maite Barroso- EGEE-III First Review 24-25 June 2009

Grid Operations Reliable, multi-VO, large scale production infrastructure Uninterrupted service Operational processes, tools and documentation Worldwide collaboration between ROCs and sites SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Size of the infrastructure Number of EGEE-III certified sites Computing resources: 155 MSI2k at the end of January 2009 already more than the 124 MSI2k planned for the end of the project! Storage resources: Currently deployed information providers have known issues, unreliable data Ongoing initiative, started by WLCG, to review and fix them Foreseen for Y2 Number of EGEE-III certified sites per region SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Usage of the infrastructure (I) Monthly production normalized CPU time by VO Monthly production normalized CPU time by ROC Number of EGEE-III certified sites per region Steady increase in the usage of the grid resources by most VOs Some of the larger VOs show considerable fluctuations, due to specific challenges Substantial increase for some VOs: ATLAS, LHCb and CMS Remarkable increase in the usage of the grid resources SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Usage of the infrastructure (II) Number of jobs Steadily increasing till October ‘08, stable since then 10 million jobs per month 370.000 jobs/day (188.000 last year, doubled since then!) SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Usage of the infrastructure (III) Data transfers The bulk of the data transported can be credited to the four LHC VOs Peaks of data transfer activity in Spring and Summer 2008, WLCG service challenges and stress tests in preparation of the start of the operational phase of the LHC Slowly increasing in the last months Sustained data rates of more than 0.9 GB/s with peaks up to 1 GB/s SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Seed resources Pool of compute and storage resources made available to new VOs to ease the process of becoming a user of the EGEE e-Infrastructure (with dedicated funding) Resources (257 cores and 27 TB of disk space) allocated to 4 sites, with well defined usage policies, up and running since January ‘09 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

SLA Roll-out SLAs facilitate the establishment of a partnership between infrastructure management structures and resource centres (sites) to provide a defined quality of services to the users of resources. Slow but steady progress in all regions 127 sites out of 264 (48%) have signed the SLA: • Some ROCs sign with the national grid organizations (UKI, Italy) • Others consider equivalent the signature of the WLCG MoU (France) Complete set of metrics defined • Site availability/reliability is gathered automatically every month • All others gathered quarterly, from different sources, some of them not automated • Ongoing work at CESGA to provide an operations metrics portal collecting all metric results To change: View -> Header and Footer

Site Availability / Reliability Availability and reliability targets are defined in the EGEE ROC-Site SLA (70% Availability, 75% Reliability) Results published monthly as the EGEE League Table • https://edms.cern.ch/document/963325/ Systematic review of results by ROCs and SA1 management Since May 2008, steady, albeit irregular, improvement of overall site availability. Discovering limitations of weighting by CPU count due to server consolidation SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Site Availability Improvements May 2008 May 2008 April 2009 Figures show that the regular monitoring of the SAM tests results and the associated follow-up activity contributed to improve both the overall and the regional Availability and Reliability. SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Site Availability evolution SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Release and deployment management Releases of new middleware must not disrupt the operational state of the production infrastructure: • incremental updates of the middleware has proved to be effective • there were nevertheless a few incidents affecting the production system during the deployment of some updates: post-mortems carried out with SA3 for these incidents • standard mechanism to roll-back a middleware upgrade • staged roll-out at selected sites, to detect critical incidents as early as possible This goes in the direction of the future model that SA1 is putting in place: including staged roll-outs, fine grained versioning of the grid services, and a reliable production repository SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Pre-Production Service Pilot services: • New service: on-demand previews of new middleware functionalities to interested users • 5 pilot services (WMS 3.1, Site Central Authorization Service (SCAS), CREAM CE, VOMS and SLC5 Worker Nodes) • very successful, valuable to the user and operations community • Community effort based on common interests can work - with a thin layer for planning, coordination and tracking. Deployment testbed: • due to improvements in certification, focus is changing • many regions undertake their own rollout tests before wide-scale release • Will evolve into a ‘staged rollout’ composed of representative sites from the regions that undertake the deployment of new certified software release in a timely manner SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Operational security Day-to-day operations focused on security incidents and vulnerabilities reported • None involved the middleware as an infection vector • No significant impact on the infrastructure Security "drills" early 2009 Tier1s campaign: clear overall improvement from the sites Cooperation with the OAT for most of the security monitoring Collaboration with the NRENs identified as a priority by the ROCs • Appropriate contact points identified and appointed on both sides • Local and global cooperation being improved Security training and dissemination • Full scale security training event organised at EGEE 08 Additional gLite-specific security recommendations published SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Operational security Software vulnerabilities • 28 new security vulnerabilities handled by the team • Comprehensive vulnerability handing process published Joint Security Policy Group • New mandate adopted • Clarified the stake-holders of the group • Confirmed the aim of preparing general policies for use on many Grids. • Four policy documents were approved • Approval of Certification Authorities • Grid Security Traceability and Logging Policy • VO Operations Policy and Policy on Grid • Multi-User Pilot Jobs International Grid Trust Federation • Significant progress was made on policies for operation of authorization services SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Global Grid User Support Regional support with central coordination GGUS is the central integration platform, connected to other support structures (regional helpdesk, VO support infrastructures, etc) Users can choose to submit a support request to the central GGUS, to their Regional Operations Centre (ROC), or to their Virtual Organisation (VO) support service Support procedures are continuously updated and improved. Best practices are shared between supporters, and documented in a knowledge base for all grid-related problems and their solutions. SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Global Grid User Support Number of trouble tickets has been almost constant over time Not particularly affected by the increasing size of the EGEE e-Infrastructure and the number of users Most tickets belong to ENOC and CIC Support Units SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Grid Operator on duty Role of oversight and 1st level support for grid production infrastructure • Critical activity in maintaining usability and stability of sites • first-line support model based on a central group of operators on duty (COD) opening tickets to sites in case of grid monitoring alarms Work in EGEE III to define a new model, based on the devolution to regions • First-line support done by each region, plus common layer for procedures, tools, escalation • New procedures and organizational scheme have been identified according to the requirements from existing COD teams, ROCs and sites, together with a migration work plan • Four pilot federations have been identified: Central Europe, Northern Europe, Asia-Pacific and South West Europe. Expected advantages: • improvement in terms of number of tickets handled and response time • preparation to a sustainable infrastructure based on the distribution of responsibilities to federations. SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Grid Operations Automation • Aims • Improve reliability and availability of sites via improved operational tools • Increase automation of operations infrastructure • Prepare operational tools for use in an EGI/NGI structure • Operations Automation Team (OAT) with representatives from ROCs, sites, all operation tools, and related infrastructure projects • Strategy document at PM1 outlining technical architecture to achieve these aims • New regional operation monitoring and ticketing flows defined by COD team, and implemented by OAT tools • Nagios, Regional Dashboard, GGUS SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Operations Automation Team Focus: Site Monitoring via Nagios, a commodity open-source monitoring framework Integration of operational tools via ActiveMQ, an open-source enterprise messaging system Achievements: providing sites with a ready-to-deploy Nagios monitoring solution, which configures itself automatically and includes a reference set of grid probes Nagios couples grid service monitoring with local fabric monitoring 120 sites monitored at site 174 sites monitored at ROCs Next Steps: Phased release of updated operational tools to meet the issues of a regional deployment

Regionalized operations tools Architecture and design phase now finished All tools have provided plans with functionality and milestones for delivery A set of milestone deliverables which give a complete functionality • 3 month intervals, starting April 2009 If timescales slip, we can stop at any of the milestones and have a functional solution • Sacrificing functionality or distribution

Plans for Y2 Main goal is to transition to the operating model and infrastructure proposed by the EGI Blueprint, for all SA1 tasks, with no disturbance to the reliable EGEE production infrastructure • Define which other tasks/roles will be regionalized, and make a plan to achieve it • Finalize the regionalization for the tasks already identified (COD, user support) • Finalize operation tool developments necessary to enable regionalization, and deploy them transparently in production • Revise the software release and deployment procedure that uses a ‘staged rollout’ as opposed to the Deployment Testbed in the current PPS SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Summary EGEE Infrastructure has continued to increase in size, scale, usage and reliability Distribution and automation are the driving forces Distribution: We are gradually evolving the operations model to move responsibility to the regions, this has an impact in effort, tools, procedures • Intense program of work for Y2 • Preserving the collaboration is essential for this and for the future EGI/NGI model Automation: by devolving a complete solution for grid monitoring to sites/ROCs, and a complete operations toolkit integrated through well defined interfaces and using messaging SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009

Grid Operations SA1 Status Report