450 likes | 532 Views
EGEE Operations. John Gordon, CCLRC-RAL International Symposium of Grid Computing 2005 Taipei, April 2005. Acknowledgements. The work behind this presentation and many of the slides are the work of others: Hélène Cordier, Gilles Matthieu, Pierre Girard IN2P3
E N D
EGEE Operations John Gordon, CCLRC-RAL International Symposium of Grid Computing 2005 Taipei, April 2005
Acknowledgements • The work behind this presentation and many of the slides are the work of others: • Hélène Cordier, Gilles Matthieu, Pierre Girard IN2P3 • Piotr Nyczyk, Judit Novak, Laurence Field CERN • Dave Kant, Philippa Strange, Matt Thorpe, CCRC-RAL EGEE Operations
Outline • Operations • CIC Portal • GOCDB • Monitoring • CIC on Duty • Accounting • What’s New • BDII config • UI to GOCDB • Register for a VO • Historic Monitoring Data EGEE Operations
Objective: A day in SA1 Ops life • CIC • Definition • Operational tool • CIC Operations • CIC-on-duty • Definition • Procedure • Operations • Monitoring tools • Escalation Scenarios • Severity • Deadline Expiration • Next Steps EGEE GLOSSARY SA1 EGEE European Grid Support, Operation and Management activity SA2 EGEE Network Resource Provision Activity OMC Operation Management Centre CIC Core Infrastructure Centre ROC Regional Operation Centre RC Resource Centre EGEE Operations
CIC • Operate essential grid services and act as Grid Operation Centre • Run Central Grid Services • once/grid services like top BDII, accounting, VO • catch-all services also run by others RB, UI, BDII • Support VO Services • Objectives • Transparency • Information sharing • Troubleshooting in conjunction with ROCs EGEE Operations
CIC portal Overview • http://cic.in2p3.fr • a management tool for CIC objectives • an entry point for all EGEE actors for their operational needs EGEE Operations
SUPERVISION: CIC-on-duty tools • CIC-on-duty Operations • Monitoring • Ticketing/tracking • Reporting CICs-on-duty check and report problems occurring on the EGEE grid. Involved institutes : CERN – IN2P3 – INFN – RAL – Academia Sinica EGEE Operations
Ticketing system Interface GGUS CIC Portal mails SFT GOC DB Gstat Static data … Monitoring tools SUPERVISION: CIC-on-duty tools • Involved tools EGEE Operations
COMMUNICATION • Contacts lists • ROC Managers/Deputies • Stored in the GOC-DB, maintained by GOC and IN2P3 • CIC Managers/Deputies • Stored and maintained by IN2P3 • CICs-on-duty staff • Stored and maintained by IN2P3 • VO Managers / technical experts • Contact “collection” (ESC, OMC, GGUS...) • Maintained by OMC EGEE Operations
COMMUNICATION • EGEE BROADCAST • A link between different communities (e.g. sites – VOs...) • Technical or General Publications • Mails and News • Available from the GGUS Portal EGEE Operations
INFORMATION • Information on VOs • Resources: Storage and CPUs • Services EGEE Operations
INFORMATION • Information on sites • CEs and corresponding CPUs • SEs and corresponding storage capacities • Supported VOs - Queues • Configuration/middleware • Information on services • Resource Brokers • BDIIs • MyProxies • R-GMA • RLS EGEE Operations
GOCDB • The Grid Operations Centre Database • A database sitting behind a certificate-enabled web front-end (GridSite) • Anyone with a certificate from a recognised VO can read • Site Managers can update their site information • Region Managers can update all sites in their region and add new ones • Contains static information about grid sites like:- • Name, location, phone contact, • Hostnames of various grid nodes (not Worker nodes) • Scheduled downtime • Contact details for staff • Used for configuration of monitoring, accounting, middleware,…. EGEE Operations
Organisational Structures • Developed a tool to manage organisational structures. • Modelled on GridPP Tier1/2 Structure Materialised Path Encoding • Provide ROCs with a package to monitor the resources in the region • Tailored Monitoring • Administrative roles to the coordinators in GOCDB EGEE Operations
Selection of Monitoring tools GIIS Monitor GIIS Monitor graphs Sites Functional Tests GOC Data Base Scheduled Downtimes Live Job Monitor GridIce – VO view GridIce – fabric view Certificate Lifetime Monitor Note: Those thumbnails are links and are clickable. EGEE Operations
CIC-on-duty CIC–on-duty agenda weekly shifts GGUS, ROC User-support, mail Monitor, diagnose troubles Contact site administrators, ROC Problem Tracking System Follow-up Weekly meetings Quarterly meetings Log file EGEE Operations
CIC-on-duty • CIC-on-duty Operations • Monitoring • Ticketing/tracking • Reporting CICs-on-duty check and report problems occurring on the EGEE grid. Involved institutes : CERN – IN2P3 – INFN – RAL – Academia Sinica EGEE Operations
Ticketing system Interface GGUS CIC Portal mails SFT GOC DB Gstat Static data … Monitoring tools SUPERVISION: CIC-on-duty tools • Involved tools EGEE Operations
CIC Operations https://cic.in2p3.fr/index.php?id=cic&subid=cic_dash2 Cic operational procedure High level abstraction of core tools results Link all existing tools • monitoring • diagnosis • communication • mail • follow up log Ops Procedure EGEE Operations
CIC on Duty Dashboard • Current status of active operational problems • List of sites for easy status update and raising tickets • List of sites failing various tests • List of open trouble tickets EGEE Operations
GGUS • Global Grid User Support • was developed by FZK for user support tickets • now used for handling tickets for operational problems by COD • The front-end pages are: • a main page defined as entry point, • a page showing details about a given site, • a page showing details about a given ticket, and allowing to update it, • a page allowing to create a ticket and to contact the corresponding site. • Information displayed is: • list all GGUS tickets, or view just some of them using filters (open, expired) • list GGUS tickets for a given site • view details or history for a given GGUS ticket • view SFT results and gstat current status for a given site • view contact informations for a given site • Possible actions through this interface are: • Ask for details of SFT results for a given site • report a new problem for a given site (Create a new GGUS ticket and contact the corresponding site) • update/escalate a given GGUS ticket • send the mail when setting “Action Taken” to “2nd mail to site admins” EGEE Operations
Escalation If sites do not respond to initial ticket or resolve the issue, it is escallated They get a second mail They get a phone call Their Manager gets a phone call Their Region gets alerted The GDB is notified Then what? No-one has gone this far See future work for methods of removing sites. EGEE Operations
Current state of operations • Four sites sharing duty in weekly turns • Procedures defined, constantly under review and in use • Monitoring tools and in-depth testing • Communication tool • Problem Tracking System EGEE Operations
http://goc.grid-support.ac.uk/gridsite/accounting/index.html Each Site, per VO, per Month GOC Accounting Services On Demand Services to EGEE Community Simple interface to customise views of data: VO, time frame and Region (default = EGEE) BaseCpuSeconds Aggregated across EGEE Each Region, per VO, per Month Other Distributions Normalised CPU # Jobs EGEE Operations
Accounting Reports on Demand Select date range Select VOs (Default = All) Web form to apply selection criteria on the data Aggregate data across an organisation structure (Default= All ROCs) EGEE Operations
Summed CPU (Seconds) consumed by resources in selected Region VO Index Selected Date Range EGEE Operations
New Work • Security Contacts • Weekly reports • Volunteering to support a VO • User Interface to GOC Information • Managing the BDII EGEE Operations
INFORMATION • Security Contacts • Work with Operational Security Coordination (Ian Neilson, CERN – Romain Wartel, RAL) • RSS feed on security alerts will be displayed • Contacts used to populate security alert mailing lists EGEE Operations
CIC: Regional Operations Regional Weekly Report (GDA) • Current status • Weekly reports are currently attached directly by ROCs to the agenda • Not easy to search in previous reports • Proposal: an online form • Automatically filled (scheduled downtimes, sites status...) • Stored in a way that allows search mechanism • Such a form is currently online for testing EGEE Operations
CIC: Regional Operations • New VO integration for a RC • Proposal for sites that want to support a new VO EGEE Operations
Reporting Tool Prototype Organisational Identities taken from GOCDB EGEE Operations
Reporting Tool Prototype EGEE Operations
LDAP LDAP LDAP Managing the Information System • Sites publish supported VOs. • VOs query the top level BDII. • VOs want to choose sites. • VOs specific BDIIs. • Do not work. • Inconsistent view of the Grid. • Solution • Use the same list of sites everywhere. • Use ldapmodify after the ldapadd • Download the LDIF from a web page. • The VOs can choose what sites to use. • Web page portal, using GridSite Security. • Can be used to do other modifications. Top Level BDII Site Level BDIIs Site A Resources Site B Resources EGEE Operations
Main Page EGEE Operations
VO Page EGEE Operations
Final List EGEE Operations
GOC DB Site info Gstat Data Site Functional Tests GOC Hourly Tests Total List of all sites RGMA GOC Bit Sites pass core tests Black List Trusted Sites BDII White List Adaptive Job Brokering Generation of BDII configuration file via feedback into IS Monitoring Services 100’s of Sites Environments Production, VO, GridPP, … Total List of all sites is derived from GOCDB (via RGMA) GOC bit: sites which have opted out e.g. scheduled maintenance White List: Sites that failed one or more core tests but are well supported are put back in e.g. a Tier1 site Core tests are a subset of the site functional tests run by CERN every day Black List: Sites that are not trusted EGEE Operations
BDII Summary • Consistent view of the Grid. • VOs can choose what sites to use. • and under what circumstances. • Increased service availability. • Failing sites automatically removed. • Higher success rate. • Extensibility. • Mechanism can be used for other purposes. • CIC/ROC management. • Scheduled/Unscheduled downtime etc. EGEE Operations
Summary and Conclusion • Tools for operations have been developed • and are now being used seriously to keep the service running • Link between communities • GOCs, Regional OCc, developers, supporters, sites • Link between activities • This is the result of a growing collaboration between various partners EGEE Operations
Thank You • To Academia Sinica • for their invitation and hospitality • To Simon and his staff • for their friendly reception and help EGEE Operations