210 likes | 388 Views
Grid Operations in EGI / NGIs Operations task-force: Maite Barroso, Jamie Shiers, Nick Thackray (CH), Sven Hermann (DE), Rolf Rumler (FR), Per Öster (FI), Fotis Karayannis (GR), Tiziana Ferrari (IT) + input from John Gordon (GB) and many others. A Draft Model for Discussion
E N D
Grid Operations in EGI / NGIsOperations task-force: Maite Barroso, Jamie Shiers, Nick Thackray (CH), Sven Hermann (DE), Rolf Rumler (FR), Per Öster (FI), Fotis Karayannis (GR), Tiziana Ferrari (IT) + input from John Gordon (GB) and many others... A Draft Model for Discussion Jamie.Shiers@cern.ch Rome, 13-14 March 2008
Disclaimer • What is presented here is clearly a draft for discussion • It is based upon a number of assumptions, which will be explained next • Feedback on this proposal, not only from the NGIs and experts, but also from the Application Communities that they support, is clearly required and is actively solicited www.eu-egi.org
Key Assumptions • Timeline: • There are many applications using Grid resources in a production fashion today • Continuity must be provided to these communities – in other words the move to an “EGI world” must be non-disruptive and on time • Functionality: • Similarly, the key functionality provided in terms of operations must be maintained both during and after the transition phase www.eu-egi.org
What is EGI Operations? • To answer this question, we need a much better idea of what “the EGI Grid” will be… Is it: • A large-scale, production Grid infrastructure – build on National Grids that interoperate seamlessly at many levels, offering reliable and predictable services to a wide range of applications, ranging from “mission critical” to prototyping and research? • A loosely coupled federation of NGIs with little or no cross-grid activity, heterogeneous and sometimes incompatible middleware stacks, no cross-grid accounting, no need for coordinated operations or management
What is EGI Operations? • To answer this question, we need a much better idea of what “the EGI Grid” will be… Focus on: • A large-scale, production Grid infrastructure – build on National Grids that interoperate seamlessly at many levels, offering reliable and predictable services to a wide range of applications, ranging from “mission critical” to prototyping and research • A loosely coupled federation of NGIs with little or no cross-grid activity, heterogeneous and sometimes incompatible middleware stacks, no cross-grid accounting, no need for coordinated operations or management
And the EGI Added Value? • In order to be both attractive and maintainable, Grids need to have the following attributes: • Low cost of entry; • Low cost of ownership. both in terms operations as well as application and user support • The basic principles of reliability and usabilitymust be designed in from the start – adding them later is not consistent with the goals of low cost of ownership.
How is this achieved? • We should not forget one of the key features of the Grid – resilience to failure / scheduled downtime of individual components and / or sites • This significant advantage can only be realised through a sufficient degree of interoperability & interoperation • But gives individual NGIs much more freedom & flexibility! www.eu-egi.org
EGI Operations Principles • Reliability of Grid services and SLAs; • Multi-level operation model; • EGI, NGI and ROC; • Multiple middleware stacks; • Planning, coordination and gathering of new requirements; • Cooperation; • Federation, interoperability and data aggregation. www.eu-egi.org
Reliability / SLAs • A large-scale, production Grid infrastructure – built on National Grids that interoperate seamlessly at many levels, offering reliable and predictable services to a wide range of applications, ranging from “mission critical” to prototyping and research. • It is understood that it will be a long and continuous process to reach this goal, with additional NGIs and/or application communities joining at different times, with varying needs and different levels of “maturity” • In addition, sites of widely varying size, complexity and stage of maturity must clearly be taken into account • The EGI shall negotiate the minimal size and set of functions for an NGI to participate in a wider context, including the associated Service Level Agreements. • This includes the agreement and follow-up of the associated certification processes. In some cases, these requirements may be more stringent than those used within a given NGI. • Only a subset of sites participating within an NGI may satisfy the wider requirements at the EGI level. www.eu-egi.org
Multi-level operation model • Highly centralized models – e.g. for monitoring – have been shown to be both intrusive and non-scalable • This suggests a move to a multi-level operations model • e.g. EGI/regional “cluster”/ NGI … • Whilst building on the positive experience of today’s production Grids, these concerns must nevertheless be taken into account as part of the EGI / NGI architecture. • This includes designing and deploying for low-cost-of entry and ownership, whilst maintaining sufficient flexibility to meet the requirements of the application clusters. • The EGI shall foster agreement on the definition of the key operations infrastructure, its establishment and delivery. • Such functions are preferably located at one or more NGIs • to offer both resilience and scalability www.eu-egi.org
EGI, NGI and ROC • The participation of NGIs to the operation of the European grid infrastructure requires a set of services to be operated in a coherent way. • Currently, within EGEE, this is guaranteed by the ROCs, that either span over several countries (NGIs) or are serving one country only. • The NGIs must assure that the services are operated, either at the NGI level or through associating into ROC equivalents. • Regardless of the technical organization, all the NGIs need to be individually represented in an EGI operations board, where strategies and general problems are discussed. www.eu-egi.org
Multiple middleware stacks • EGI operations will be responsible for guaranteeing support to all the adopted middleware stacks in collaboration with the operations staff from NGIs www.eu-egi.org
Planning, coordination and gathering of new requirements • The EGI operations team is mainly responsible for operations planning and coordination of efforts by the various NGIs and other parties. • Also, EGI operations staff work towards a smooth evolution of tools and operational procedures according to [any] new requirements [ that are] gathered www.eu-egi.org
Cooperation • EGI and NGI operations cooperate to solve problems of common interest such as: • guidelines for robust services, • security best practices, • middleware security issues, • steering of new developments, • site maintenance, • intervention procedures, • incident response, • escalation procedures • and so forth. • For this reason, EGI promotes and coordinates meetings, workshops, EGI and NGI joint working groups, etc. www.eu-egi.org
Federation, interoperability and data aggregation • EGI must federate a variety of operational aspects – some of which are implemented by NGIs and/or component sites. • Consistency of security procedures, user support, incident tracking, monitoring and accounting must be ensured. • EGI ensures interoperability of operational tools/infrastructures for security, monitoring, support, accounting, etc. • In order to aggregate usage information for VOs, users and NGIs, operational data such as • monitoring information, • availability statistics and accounting records – collected by the NGIs need to be aggregated at the EGI level for SLA monitoring in full respect of the relevant national legal constraints. www.eu-egi.org
Core Operations Tasks • Regional Operations coordination; • Coordination and support for roll out of mw updates; • Grid security and incident response coordination; • Interoperations (OSG, EU related projects); • Weekly operations meetings and operations workshops; • Support from mw resident service experts; • Middleware release support; • VO Membership Service; • Service Availability Monitoring; • User support coordination and the global Grid user support (GGUS); • Certification authority for various VOs; • Monitoring; • Pre-production coordination; • Triage of incoming problems and assignment of tickets to second line support units www.eu-egi.org
Operations Resources • Resource estimation from draft document for EGI_DS deliverable 5.1 www.eu-egi.org
EGI Transition Proposal • The EGEE Grid is currently used for large-scale production by a number of scientific VOs. • It will be unacceptable to them to have a disruptive transition to a different operational grid in EGI. • The two options are: • Define the EGI model very quickly to allow a smooth transition during EGEE III • Assume that Day One EGI Operations follow the EGEE model and any subsequent change is evolutionary • Given the experience in previous Grid projects, it is presumably too late for the first so we propose a working assumption of the second. Adapted from proposal by John Gordon, hence focus on EGEE. Requirement forsmooth and timely transition equally valid for other production Grids!
How to achieve this? • The EGEE Operational model has three levels: EGEE-wide, Regional, and National • Don’t forget we already have national duties like CA management. • The migration to EGI will involve a migration of duties down towards NGIs • The migration from Central to Regional has started in EGEE III • Our proposal is that responsibility for the balance between Regional and National be left to the group of NGIs that make up each existing Region. • They have the joint duty to continue the existing EGEE service in their region. • They have the freedom to deliver this any way they choose • at one extreme they may decide to continue with the existing ROC and organise its funding internally. • at the other they may decide to devolve everything to each NGI • More likely is some combination of the two, with some migration from the former to the latter over time. • Leave this to the regions. They can then progress independently as suits regional and national needs and priorities. EGI defines and monitors the operational service definition to ensure a seamless grid for the users. Adapted from proposal by John Gordon, hence focus on EGEE. Requirement forsmooth and timely transition equally valid for other production Grids!
Key Issues • Non-disruptive & timely transition from current Operations scenarios to EGI+NGIs • Ensuring “value-for-money”: • Applications Communities; • NGIs; • Funding agencies; must all be convinced that any money involved is not only well but also optimally spent! www.eu-egi.org
Summary & Conclusions • Over many years we have built up production Grid infrastructures, much experience in their operations, as well as large and diverse application communities • Continuity is a key requirement for the future, as we move to a sustainable, long-term and multi-disciplinary e-infrastructure • Value for money is essential and at all levels www.eu-egi.org