1 / 45

EGEE Operations

EGEE Operations. John Gordon, CCLRC-RAL International Symposium of Grid Computing 2005 Taipei, April 2005. Acknowledgements. The work behind this presentation and many of the slides are the work of others: Hélène Cordier, Gilles Matthieu, Pierre Girard IN2P3

jewel
Download Presentation

EGEE Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EGEE Operations John Gordon, CCLRC-RAL International Symposium of Grid Computing 2005 Taipei, April 2005

  2. Acknowledgements • The work behind this presentation and many of the slides are the work of others: • Hélène Cordier, Gilles Matthieu, Pierre Girard IN2P3 • Piotr Nyczyk, Judit Novak, Laurence Field CERN • Dave Kant, Philippa Strange, Matt Thorpe, CCRC-RAL EGEE Operations

  3. Outline • Operations • CIC Portal • GOCDB • Monitoring • CIC on Duty • Accounting • What’s New • BDII config • UI to GOCDB • Register for a VO • Historic Monitoring Data EGEE Operations

  4. Objective: A day in SA1 Ops life • CIC • Definition • Operational tool • CIC Operations • CIC-on-duty • Definition • Procedure • Operations • Monitoring tools • Escalation Scenarios • Severity • Deadline Expiration • Next Steps EGEE GLOSSARY SA1 EGEE European Grid Support, Operation and Management activity SA2 EGEE Network Resource Provision Activity OMC Operation Management Centre CIC Core Infrastructure Centre ROC Regional Operation Centre RC Resource Centre EGEE Operations

  5. CIC • Operate essential grid services and act as Grid Operation Centre • Run Central Grid Services • once/grid services like top BDII, accounting, VO • catch-all services also run by others RB, UI, BDII • Support VO Services • Objectives • Transparency • Information sharing • Troubleshooting in conjunction with ROCs EGEE Operations

  6. CIC portal Overview • http://cic.in2p3.fr • a management tool for CIC objectives • an entry point for all EGEE actors for their operational needs EGEE Operations

  7. SUPERVISION: CIC-on-duty tools • CIC-on-duty Operations • Monitoring • Ticketing/tracking • Reporting CICs-on-duty check and report problems occurring on the EGEE grid. Involved institutes : CERN – IN2P3 – INFN – RAL – Academia Sinica EGEE Operations

  8. Ticketing system Interface GGUS CIC Portal mails SFT GOC DB Gstat Static data … Monitoring tools SUPERVISION: CIC-on-duty tools • Involved tools EGEE Operations

  9. COMMUNICATION • Contacts lists • ROC Managers/Deputies • Stored in the GOC-DB, maintained by GOC and IN2P3 • CIC Managers/Deputies • Stored and maintained by IN2P3 • CICs-on-duty staff • Stored and maintained by IN2P3 • VO Managers / technical experts • Contact “collection” (ESC, OMC, GGUS...) • Maintained by OMC EGEE Operations

  10. COMMUNICATION • EGEE BROADCAST • A link between different communities (e.g. sites – VOs...) • Technical or General Publications • Mails and News • Available from the GGUS Portal EGEE Operations

  11. INFORMATION • Information on VOs • Resources: Storage and CPUs • Services EGEE Operations

  12. INFORMATION • Information on sites • CEs and corresponding CPUs • SEs and corresponding storage capacities • Supported VOs - Queues • Configuration/middleware • Information on services • Resource Brokers • BDIIs • MyProxies • R-GMA • RLS EGEE Operations

  13. GOCDB • The Grid Operations Centre Database • A database sitting behind a certificate-enabled web front-end (GridSite) • Anyone with a certificate from a recognised VO can read • Site Managers can update their site information • Region Managers can update all sites in their region and add new ones • Contains static information about grid sites like:- • Name, location, phone contact, • Hostnames of various grid nodes (not Worker nodes) • Scheduled downtime • Contact details for staff • Used for configuration of monitoring, accounting, middleware,…. EGEE Operations

  14. Organisational Structures • Developed a tool to manage organisational structures. • Modelled on GridPP Tier1/2 Structure Materialised Path Encoding • Provide ROCs with a package to monitor the resources in the region • Tailored Monitoring • Administrative roles to the coordinators in GOCDB EGEE Operations

  15. Selection of Monitoring tools GIIS Monitor GIIS Monitor graphs Sites Functional Tests GOC Data Base Scheduled Downtimes Live Job Monitor GridIce – VO view GridIce – fabric view Certificate Lifetime Monitor Note: Those thumbnails are links and are clickable. EGEE Operations

  16. CIC-on-duty CIC–on-duty agenda weekly shifts GGUS, ROC User-support, mail Monitor, diagnose troubles Contact site administrators, ROC Problem Tracking System Follow-up Weekly meetings Quarterly meetings Log file EGEE Operations

  17. CIC-on-duty • CIC-on-duty Operations • Monitoring • Ticketing/tracking • Reporting CICs-on-duty check and report problems occurring on the EGEE grid. Involved institutes : CERN – IN2P3 – INFN – RAL – Academia Sinica EGEE Operations

  18. Ticketing system Interface GGUS CIC Portal mails SFT GOC DB Gstat Static data … Monitoring tools SUPERVISION: CIC-on-duty tools • Involved tools EGEE Operations

  19. CIC Operations https://cic.in2p3.fr/index.php?id=cic&subid=cic_dash2 Cic operational procedure High level abstraction of core tools results Link all existing tools • monitoring • diagnosis • communication • mail • follow up log Ops Procedure EGEE Operations

  20. CIC on Duty Dashboard • Current status of active operational problems • List of sites for easy status update and raising tickets • List of sites failing various tests • List of open trouble tickets EGEE Operations

  21. EGEE Operations

  22. EGEE Operations

  23. EGEE Operations

  24. GGUS • Global Grid User Support • was developed by FZK for user support tickets • now used for handling tickets for operational problems by COD • The front-end pages are: • a main page defined as entry point, • a page showing details about a given site, • a page showing details about a given ticket, and allowing to update it, • a page allowing to create a ticket and to contact the corresponding site. • Information displayed is: • list all GGUS tickets, or view just some of them using filters (open, expired) • list GGUS tickets for a given site • view details or history for a given GGUS ticket • view SFT results and gstat current status for a given site • view contact informations for a given site • Possible actions through this interface are: • Ask for details of SFT results for a given site • report a new problem for a given site (Create a new GGUS ticket and contact the corresponding site) • update/escalate a given GGUS ticket • send the mail when setting “Action Taken” to “2nd mail to site admins” EGEE Operations

  25. EGEE Operations

  26. Escalation If sites do not respond to initial ticket or resolve the issue, it is escallated They get a second mail They get a phone call Their Manager gets a phone call Their Region gets alerted The GDB is notified Then what? No-one has gone this far See future work for methods of removing sites. EGEE Operations

  27. Current state of operations • Four sites sharing duty in weekly turns • Procedures defined, constantly under review and in use • Monitoring tools and in-depth testing • Communication tool • Problem Tracking System EGEE Operations

  28. EGEE Operations

  29. http://goc.grid-support.ac.uk/gridsite/accounting/index.html Each Site, per VO, per Month GOC Accounting Services On Demand Services to EGEE Community Simple interface to customise views of data: VO, time frame and Region (default = EGEE) BaseCpuSeconds Aggregated across EGEE Each Region, per VO, per Month Other Distributions Normalised CPU # Jobs EGEE Operations

  30. Accounting Reports on Demand Select date range Select VOs (Default = All) Web form to apply selection criteria on the data Aggregate data across an organisation structure (Default= All ROCs) EGEE Operations

  31. Summed CPU (Seconds) consumed by resources in selected Region VO Index Selected Date Range EGEE Operations

  32. New Work • Security Contacts • Weekly reports • Volunteering to support a VO • User Interface to GOC Information • Managing the BDII EGEE Operations

  33. INFORMATION • Security Contacts • Work with Operational Security Coordination (Ian Neilson, CERN – Romain Wartel, RAL) • RSS feed on security alerts will be displayed • Contacts used to populate security alert mailing lists EGEE Operations

  34. CIC: Regional Operations Regional Weekly Report (GDA) • Current status • Weekly reports are currently attached directly by ROCs to the agenda • Not easy to search in previous reports • Proposal: an online form • Automatically filled (scheduled downtimes, sites status...) • Stored in a way that allows search mechanism • Such a form is currently online for testing EGEE Operations

  35. CIC: Regional Operations • New VO integration for a RC • Proposal for sites that want to support a new VO EGEE Operations

  36. Reporting Tool Prototype Organisational Identities taken from GOCDB EGEE Operations

  37. Reporting Tool Prototype EGEE Operations

  38. LDAP LDAP LDAP Managing the Information System • Sites publish supported VOs. • VOs query the top level BDII. • VOs want to choose sites. • VOs specific BDIIs. • Do not work. • Inconsistent view of the Grid. • Solution • Use the same list of sites everywhere. • Use ldapmodify after the ldapadd • Download the LDIF from a web page. • The VOs can choose what sites to use. • Web page portal, using GridSite Security. • Can be used to do other modifications. Top Level BDII Site Level BDIIs Site A Resources Site B Resources EGEE Operations

  39. Main Page EGEE Operations

  40. VO Page EGEE Operations

  41. Final List EGEE Operations

  42. GOC DB Site info Gstat Data Site Functional Tests GOC Hourly Tests Total List of all sites RGMA GOC Bit Sites pass core tests Black List Trusted Sites BDII White List Adaptive Job Brokering Generation of BDII configuration file via feedback into IS Monitoring Services 100’s of Sites Environments Production, VO, GridPP, … Total List of all sites is derived from GOCDB (via RGMA) GOC bit: sites which have opted out e.g. scheduled maintenance White List: Sites that failed one or more core tests but are well supported are put back in e.g. a Tier1 site Core tests are a subset of the site functional tests run by CERN every day Black List: Sites that are not trusted EGEE Operations

  43. BDII Summary • Consistent view of the Grid. • VOs can choose what sites to use. • and under what circumstances. • Increased service availability. • Failing sites automatically removed. • Higher success rate. • Extensibility. • Mechanism can be used for other purposes. • CIC/ROC management. • Scheduled/Unscheduled downtime etc. EGEE Operations

  44. Summary and Conclusion • Tools for operations have been developed • and are now being used seriously to keep the service running • Link between communities • GOCs, Regional OCc, developers, supporters, sites • Link between activities • This is the result of a growing collaboration between various partners EGEE Operations

  45. Thank You • To Academia Sinica • for their invitation and hospitality • To Simon and his staff • for their friendly reception and help EGEE Operations

More Related