290 likes | 305 Views
Final report of EGEE-II project review with detailed statistics on manpower, budget, services, test-beds, and workload. Highlights achievements, challenges faced, and successful interoperations with OSG.
E N D
SA1 Status ReportEGEE Grid Operations & Management Maite Barroso SA1 Activity Leader IT Department, CERN Final EU Review of EGEE-II CERN 8-9th July 2008
SA1 in Numbers Manpower: 61 partners, 29 countries, 228 FTE EGEE-II Budget SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Test-beds & Services Operations Coordination Centre Production Service Pre-production service Regional Operations Centres Certification test-beds (SA3) Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Security & Policy Groups Joint Security Policy Group EuGridPMA (& IGTF) Grid Security Vulnerability Group Operations Advisory Group (+NA4) The EGEE Infrastructure Support Structures & Processes Training activities (NA3) Training infrastructure (NA4) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Cores, Sites, ROCs • 73709 cores • 255 sites (145 partner sites) • 48 countries (33 partner countries) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Workload No. jobs / month 188.000 jobs/day (98000 jobs/day 1y ago) 54 million jobs in the 2nd year 150K per day sustained average No. jobs / month – exc. HEP, Infra 17.000 jobs/day (13000 jobs/day 1y ago) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
CPU time delivered (CPU months) 33.700 CPU-month (14.000 CPU-month) exc. HEP, Infra Peak of 5700 CPU-month (3600 CPU-month) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
CCRC experience WLCG Common Computing Readiness Challenges Full-scale dress rehearsal for the accelerator run • All experiments together • Very demanding requirements, more than needed for accelerator run in 2008 • Data transfers in excess of needed levels • Workloads at scale needed for data taking • E.g. only one experiment, CMS, submitted 100.000 jobs a day routinely, 200.000 day peak without problem, to egee and OSG production grids SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
CCRC: Data transfer results • All experiments exceeded required rates for extended periods, & simultaneously • 1.3 GB/s target • Well above 2 GB/s achievable • All Tier 1s achieved (or exceeded) their target acceptance rates
CCRC experience WLCG Common Computing Readiness Challenges All this using EGEE production infrastructure and operations Reliable production service provided to WLCG Sustainable service model – people were not in panic mode Making use of interoperations with other grid infrastructures • Site availability/reliability metrics, accounting, support, operations meetings All this with no additional effort No impact in daily operations SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Interoperations • Interoperation with OSG is day-to-day business • Permanent EGEE/OSG Interoperability Platform operated by SEE region • User support processes interconnected (GGUS GOC) • Accounting: • Data published from Gratia to EGEE APEL repository • Visualization through EGEE Accounting portal • Agreed site availability/reliability metrics, stored in EGEE repository and visualized with EGEE tools • NDGF interoperates with EGEE since Y2 • Tests to probe the NDGF resources (arc-CEs) integrated in Service Availability Monitoring • All other operations components are there: accounting, resource registration (GOCDB) • Operation team from NDGF involved in EGEE grid Operator on Duty rota • Interoperation with Naregi in progress • Discussions started SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Accounting OSG Accounting Database OSG Sites GRATIA EGEE Sites • EGEE accounting portal Central Accounting Database Summary Database INFN-Grid Accounting Database NDGF Accounting Database NDGF Sites INFN-Grid Sites SGAS DGAS SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
User support User Support SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008 GGUS: process and tool well established, accepted and used by the community • “A problem is not a problem if a GGUS ticket was not open” • Problem reporting, logging and traceability VOs directly involved in shaping GGUS Recent new features • User is now involved in the final closure of a ticket • New status to simplify the work of ROCs • Extensive tests before the release • New GGUS ticket submission form with help for problem description and other precisions • Escalation reports
Grid Operations SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008 Grid Operator on Duty • Critical activity in maintaining usability and stability of sites • NDGF operations team joined • Portal for operations : https://cic.gridops.org • Regional dashboard concept: first level support for the sites in the region Continuous work on operations procedures • Contribute to establishment of regional grid infrastructures through related projects – well beyond Europe now Solid set of operational tools provided for central operations teams • Good suited for the present operational model, widely used • Many are shared with other infrastructure projects
Job success rate Present job success rate between 80% - 95% Main job failure reason is site misconfiguration Two aspects to improve this: In operations: • Provide sites with tools to monitor and detect the problems as soon as possible: grid monitoring and alarms at the sites • Operations support and training to site managers, so they learn to solve most common problems, quick involvement from experts to solve new ones • Measure and publish site reliability In applications: • Application specific monitoring of the sites: application specific tests, application dashboards • Select “good sites” (the ones that successfully pass the application tests) from application point of view; experience shows that this gets reliability close to 100% • This is done automatically in big VOs (E.g. LHC) and manually for small ones SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Site reliability: early experience CERN+Tier 1s Formal reporting of Tier 2s since October 2007 • #sites reporting has increased from 89 116 in May 08 • Overall average: 75-80%, but top 50% (20%) of sites: 95% (98%) • More than 70% of resources are at sites with >90% reliability
Grid Operations Monitoring Service Availability Monitoring (SAM): • Provides monitoring of grid services from a user perspective • Main source of monitoring information for site availability calculations • All information stored and displayed centrally Changes to move grid monitoring information to the sites • As a part of standard site monitoring, so it can raise alarms, etc • First phase: feed grid monitoring results to sites • Later, standard set of sensors to be run at the sites, they will push the information to a central repository • Site status monitoring: after survey, most widely used are Nagios (open source) and Lemon • Prototype based on the Nagios fabric monitoring system developed within the CE ROC • Enables sites to receive instant notification in case of failures • Provides them with results from global monitoring systems such as SAM and Network Monitoring SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Site, regional, central monitoring SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Network monitoring Central probes (SAM) Local probes SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
VO Monitoring • SAM widely used by LHC VOs, plugging their own VO-specific SAM tests, to determine which sites are suitable • Experiment dashboards extensively used by the LHC community • VLMED VO (biomed) using the dashboard for a year now, others interested • Dashboard framework also used in other areas: • Experiment specific: e.g. ATLAS production, CMS site availability • Interest in for operational dashboards • SAM visualization Evolution similar to grid operations monitoring: • Feed VO monitoring results to the sites • Common mechanism SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
GridMap GridMap – high-level visualization of the grid availability • Collaboration with Industry – unfunded collaboration with EDS via CERN’s openlab project • http://gridmap.cern.ch/gm Display monitoring data it in a way that operators can absorb it, using advanced visualization techniques • visualize the Grid by using Treemaps (Grid + Treemap = GridMap) GridMap is a visualization tool for looking at Service Availability and Reliability • Condenses all EGEE sites into a single view • More important problems are visually more distinctive Used in production by grid and operators • Looking at other uses of the technique and technology • E.g. Showing #Jobs, data transfer rates between sites from a VO perspective SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Gridmap SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Service Level Agreements ROC-Site Service Level Description modeled on the service management recommendations of ITIL • ~10 draft iterations, constructive input from both parties (ROCs and Sites), latest version: April ’08 • Areas covered: • Hardware and connectivity criteria • Description of services covered • Service hours • Availability • Support • Service reporting and reviewing “SLAs relate to the measurement, reporting and reviewing of service quality as delivered by IT to the business”: • Two ROCs have already signed SLDs with sites (South West Europe:8, South East Europe:2), others on-going. • EGEE site availability metrics published since start of 2008: SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Example Report • Availability of a site over a given period is the fraction of time the same was UP • Reliability of a site over a given period is the fraction of time the same was UP (Availability), divided by the scheduled availability SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Operational Security Operational Security Coordination Team (OSCT) Incident response EGEE Incident Response procedure for the sites Security Service Challenge 3 “fire drills” (Tier1s) Procedures generally understood Difficult to apply restrictions(being followed up by the MWSG) Lack of logging and traceability (being followed up by the MWSG) Number of communication problems and site misconfigurations uncovered • Monitoring • Central security tests (SAM) detected number of insecure configurations • Promote security tools usage to the sites (Nagios) • Training and dissemination • Produced training material and recommendations (e.g. ISSEG project) • Organised a security training event at EGEE 07 (successful) SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Security Policies and CA Grid Security Vulnerability Group (GSVG) Handling security vulnerabilities in gLite Assess the risks of discovered vulnerabilities and provide security advisories Published 25 advisories in the past year JSPG New and reworked policies in the last year Aiming at making policies more generic and simple for wider adoption at other infrastructures EUGridPMA and IGTF The European Policy Management Authority for Grid Authentication in e-Science Establish requirements and best practices for grid identity providers Enable a common trust domain applicable to authentication of end-entities Mature and successful collaboration, distributed activity SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008
Sustainability SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008 EGEE SA1 results: • Reliable, multi-VO, large scale production infrastructure • Uninterrupted service • Operational processes, tools and documentation • Worldwide collaboration between ROCs and sites Built together with other national and international grid infrastructures • Cooperation ensures geographical growth WLCG relies heavily on the present EGEE operations service and is dependent on its future continuation. • This is an assurance for the durability of the EGEE operations results. To become more sustainable, in EGEE III we want to distribute the responsibility for daily operations and more automation to reduce manpower We are setting the groundwork for the migration to an NGI based model
Summary SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008 Infrastructure has continued to increase in size, scale, usage and reliability EGEE operations is able to cope with the increase without major changes in structure, processes or tools • We have the right model Interoperation is a fact – used in production Distribution and automation, keys to reduce the effort in the coming years • Setting the groundwork for the migration to an NGI based model
Key documents • DSA1.4: Assessment of production service status https://edms.cern.ch/document/726140 • DSA1.5: Grid Operations Cookbook https://edms.cern.ch/document/726257 • DSA1.6: Report on ROC progress and issues https://edms.cern.ch/document/726261 • DSA1.7: Assessment of production Grid infrastructure service status https://edms.cern.ch/document/726263 • Operations manual https://edms.cern.ch/document/840932 • EGEE ROC-Site SLD https://edms.cern.ch/document/860386 • EGEE Incident Response Procedure https://edms.cern.ch/document/867454 • Virtual Organisation Operations Policy https://edms.cern.ch/document/853968 • Grid Security Traceability and Logging Policy https://edms.cern.ch/document/428037 • Approval of Certification Authorities https://edms.cern.ch/document/428038 • Policy on Grid Multi-User Pilot Jobs https://edms.cern.ch/document/855383 SA1 – Maite Barroso - EGEE-II Final EU Review – 8-9 July 2008