250 likes | 462 Views
LHC Computing Grid Deployment. Ian Bird IT Division, CERN LCG Deployment Area Manager US LHC Software and Computing Review January 14, 2003. Outline. LCG Phase I Deployment Goals Deployment Organisation LCG Deployment Plan Middleware Test beds and Services
E N D
LHC Computing Grid Deployment Ian Bird IT Division, CERN LCG Deployment Area Manager US LHC Software and Computing Review January 14, 2003
Outline • LCG Phase I Deployment Goals • Deployment Organisation • LCG Deployment Plan • Middleware • Test beds and Services • Certification, Testing, Packaging, Distribution • Operations and Security • Support • Collaborative projects • Summary Ian.Bird@cern.ch
LCG-1 Deployment Goals • Production service for Data Challenges in 2H03 & 2004 • Initially focused on batch production work • Experience in close collaboration between the Regional Centres • Must have wide enough participation to understand the issues, • Learn how to maintain and operate a global grid • Focus on a production-quality service • Robustness, fault-tolerance, predictability, and supportability take precedence; additional functionality gets prioritized • LCG should be integrated into the sites’ physics computing services – should not be something apart • This requires coordination between participating sites in: • Policies and collaborative agreements • Resource planning and scheduling • Operations and Support Ian.Bird@cern.ch
Elements of an LCG Service • Middleware: • Testing and certification • Packaging, configuration, distribution and site validation • Support – problem determination and resolution; feedback to middleware developers • Operations: • Grid infrastructure services • Site fabrics run as production services • Operations centres – trouble and performance monitoring, problem resolution – 24x7 globally • Support: • Experiment integration – ensure optimal use of system • User support – call centres/helpdesk – global coverage; documentation; training Ian.Bird@cern.ch
Deployment Organisation • Grid Deployment Board (GDB) – chair Mirco Mazzucato • Representatives from the experiments and from each country with an active Regional Centre taking part in the LCG Grid Service • Forges the agreements, takes the decisions, defines the standards and policies that are needed to set up and manage the LCG Global Grid Services • Coordinates the planning of resources for physics and computing data challenges • First task is the detailed definition of LCG-1 • includes defining the set of grid middleware tools to be deployed • LCG Deployment Group - coordinated by Deployment Area Manager • Teams at CERN and Regional Centres • Implement the elements of the LCG service: • Certification, testing etc; Operations; Support; Scheduling • Guided by agreements negotiated by GDB • Reports to Project Execution Board; SC2 Ian.Bird@cern.ch
GDB Status • 1st Meeting in Milano – Oct 4, 2002, 4 since • Set up Technical Working groups: • WG1: Define LCG-1 functionality and services • WG2: Define the schedule for rolling out the infrastructure and resources. Propose process & metrics to be used for allocation, accounting, and reporting. • WG3: Define a straightforward security and authentication model to be used in LCG-1, and identify the technical issues. Set up agreements to enable implementation. • WG4: Define ops procedures & responsibilities. Make agreements to ensure coordination of these activities. Define the requirements for a Grid Operations Centre to coordinate operational activities. • WG5: Propose a support model for LCG-1, including the scope of responsibilities for call centre/helpdesk (delayed until Feb 03). • Status: • LCG-1 on track to be defined by end Jan 03 – certainly in essentials needed to progress – functionality and services, participating sites, resources and schedules, initial operational and security models • Working group final reports end ~Jan 03 – should indicate where LCG should focus effort, process for agreements, etc. Ian.Bird@cern.ch
LCG Deployment PlanLevel 1 Milestones Ian.Bird@cern.ch
LCG Phase I Timescale in a nutshell • LCG-1 must be defined – end Jan 03 • 2 major areas to be addressed by GDB working groups • Define LCG-1 in terms of required functionality and services • Deployment schedule • Set up distributed organisational structure • Resources and scheduling, • Policies – security, authentication, etc. • Operational agreements and responsibilities • Support services • LCG-1 service must be in place – July 2003 • 6 months testing, integration, certification, packaging and deployment • Need to demonstrate performance – end 2003 • This should include adding current production services into LCG • Provide production service for data challenges in 2004 • LCG-3 Follows LCG-1 by 1 year – provides “50% complexity” service in 2005. Ian.Bird@cern.ch
Activities to Achieve:Initial Availability of First Global Service • Define LCG-1 in terms of • functionality, resources, operations, security, support • Deploy a series of evolving pilot services for testing, with increasing resources • First pilot service – Feb 1 2003 • Incremental deployment to Tier 1 and Tier 2 centres: • ~10 in 3 continents by June • Testing, certification, packaging and release of software • Certification, testing, release process defined – January 2003. • Packaging/configuration mechanism defined– March 2003. • Delivery of middleware software packages – March 1, 2003 • Iterative, incremental release cycle, with major functional releases: • V1.0 – June 1, 2003 Ian.Bird@cern.ch
activities – cont. • Set up infrastructure and operational procedures • Certificate Authorities and VO management systems in place – May 2003 • Based on existing EU and US inter-operating systems • Resource accounting and reporting procedures set up – May 2003 • Security procedures defined and agreed and in place– June 2003 • Incident response and security management • Set up operations centre and help desk (call centre) • Identify operations and call centre locations – February 1, 2003 • In place by June 2003 • LCG-1 commissioning and acceptance • 30 day commissioning period with user productions and stress tests, including 7 day acceptance period Ian.Bird@cern.ch
Activities to Achieve:LCG-1 Fully operational • Define LCG-1 performance goals – July 2003 • In concert with experiments and their data challenge requirements, set performance goals in terms of capacity, throughput, reliability, etc. A GDB working group. • 10 Regional Centres participating – October 2003 • WG2 defines the implementation schedule – may be adjusted in July. • LXBatch service merged into LCG-1 – October 2003 • All resources of LXBATCH will be grid-enabled and accessible as part of the LCG-1 service. This is a CERN activity but hopefully reflects what happens at other sites too. • Milestone release of middleware – October 2003 • V1.1 release with improved functionality – October 2003 • Review of service – November 2003 • The LCG-1 service level should be that required for the 2004 data challenges. The determination and acceptance of achieving the target will be done in a review of the service by representatives from the experiments, the regional centres and LCG. Ian.Bird@cern.ch
Middleware • Deployed middleware to be based on US and EU toolkits: • VDT • Globus, Condor, GLUE schema, EDG CA and VO tools, etc. • EDG • Resource broker • Reptor - WP2 (Data Management) – using RLS • WP4 will be used to manage CERN fabric (available for others) • VOMS • Monitoring tools • Initially based on work done by Worldgrid (iVDGL + DataTag) • Specifics (functionality, version, delivery) being firmed up now by GDB WG1 – final by beginning February. • This will provide the initial basic functionality and will evolve significantly • LCG will focus on building a robust service – changes in basic functionality driven by experiments • Deployment of these components involves not only obtaining the software but also agreeing the essential support and maintenance. Ian.Bird@cern.ch
Testbeds and Services • The deployed systems will be in several versions and functions: • Certification testbeds – both local and distributed • Integration of middleware components, Controlled changes, in-depth application testing • Prepare for release • Production service – deployed at Regional Centres • Development service – deployed at Regional Centres • Certification testbeds parallel Production and Development services – i.e. need to debug and stabilise production release in parallel with development • This is 2 prongs of the “Gordon trident” – • 3rd prong are the grid projects’ development systems: iVDGL Datagrid LCG Production Grid developers’ testbed development testbed Ian.Bird@cern.ch
Jan 05 Jan 04 Jan 03 July 03 July 05 July 04 LCG-1 Testbed LCG Services Pilot-1 LCG Certification & Test Pilot-2 LCG-3 5% DC04 10% DC05 CMS DC-2 5% 10% ALICE Timelines – LCG Phase 1 Incremental middleware releases Incrementally add regional centres LCG-1 Defined LCG-1 Fulfils Performance Goals LCG-1 Initial Service Available Computing TDR LCG-1 Full Service Available LCG-3 Fulfils Performance Goals Data Challenges ATLAS LHCb Ian.Bird@cern.ch
Certification and Testing • Will be an ongoing major activity of LCG • Part of what will make LCG a production-level service • Goals: • Certify/validate that middleware behaves as advertised and provides the required functionality (HEPCAL) • Stabilise and robustify middleware • Provide debugging, problem resolution and feedback to developers • Testing activities at all levels • Component/unit tests • Basic functional tests, including tests of distributed (grid) services • Application level tests – based on HEPCAL use-cases • Driven/implemented by the experiments – GAG set up by SC2 • Experiment beta-testing before release • Site configuration verification • JTB collaborative project - LCG, Trillium, EDG • Gather existing tests • Write/obtain missing tests Ian.Bird@cern.ch
Certification & TestingTestbeds • CERN testbed • Several “clusters” forming a local grid • Basic tests, basic grid functionality • Distributed testbed • CERN testbed + testbeds at a few other remote sites • Grid functionality • Application benchmarks and experiment beta testing • Needs several versions • Current production version – for reproducing and fixing problems • Development version • + OS versions … Ian.Bird@cern.ch
Certification & TestingRelease Strategy • Small release cycles with incremental functionality, rather than major releases where many things change • Somewhat depends on technology suppliers and their responsiveness to LCG needs, since LCG is not in control of development • There will however, be milestone functional releases in June and October 2003. • Continuous, evolutionary process • Each release goes through certification/test cycle • Only way to keep control of bugs • Goal is stability and robustness … Ian.Bird@cern.ch
Packaging and distribution • Obviously a major issue for a deployment project • Joint activity started – • Discussions LCG, EDG, VDT, EDT, iVDGL, etc. • Have produced a draft discussion document • Will soon lead to a JTB joint project • Want to provide a tool that satisfies needs of the participating sites, • Interoperate with existing tools where appropriate and necessary • Does not force solution on sites with established infrastructure • Solution for sites with nothing • Configuration is essential component • Essential to understand and validate correct site configuration • Effort will be devoted to providing configuration tools • Verification of correct configuration will be required before sites join LCG Ian.Bird@cern.ch
LCG Operations • Responsible for operating and maintaining the grid infrastructure and associated services • Gateways, information services, resource broker etc. – i.e. grid specific services • Will be a coordination between teams at CERN and at Regional Centres • Responsible also for the VO infrastructure, Authentication and Authorisation services • Security operations – incident response etc. • Build Grid Operations Centre(s) • Performance and problem monitoring; troubleshooting and coordination with site operations, user support, network operations etc. • Leverage existing experience/ideas (WorldGrid – iVDGL,EDT, etc.) • Started discussions with DataTag about future developments • Indian group provide development effort • LCG site to lead this (FZK? – not certain yet) • Once have a activity lead will expand collaborative activities • Assemble monitoring, reporting, performance, etc. tools • Start with what exists, understand what is missing and needed and build from there Ian.Bird@cern.ch
Grid Operation queries monitoring & alarms corrective actions User Local user support Local operation Local site Call Centre Grid Operations Centre Grid information service Grid operations Grid logging & bookkeeping Virtual Organisation Network Operations Centre Ian.Bird@cern.ch
Security • GOAL: Do not want to make exceptions for LCG services – they must run integrated into a site infrastructure, and be subject to all usual security and good management procedures and policies • BUT: Initially, certain to need exceptions and compromises since until now most grid middleware has sidestepped security issues • THUS: We must have a sound security policy and an agreed plan that provides for these exceptions in the short term, but shows a clear path to reach the state that the sites require Ian.Bird@cern.ch
User Support • Essential for a production service • Two aspects • Experiment integration/ consultancy • Work directly with the experiments’ computing projects to ensure efficient use of LCG services, and optimum use of resources • Act as liaison to ensure experiment specific issues are resolved • User support • Helpdesk/call centre operation • Globally distributed – 24x7, ensure single point of contact for user • Collaborative and distributed operation • Documentation • Training, tutorials, etc. Ian.Bird@cern.ch
Collaborative Projects • LCG is not a middleware development project and can only succeed by leveraging the existing and ongoing work of the various grid development projects – and (hopefully) becoming a focus for them. • There are many opportunities for common solutions, which are being actively pursued • HICB – JTB • GLUE • Schema definitions & interoperability work • New collaborative activities: • Validation and Test Suites • Distribution, Meta-Packaging, Configuration • Grid Operations Centre • DataTag, iVDGL, DTF, etc • Storage interfaces; e.g. SRM • Authentication, authorisation and security • Security managers are beginning to collaborate in the context of LCG • HEPiX/LCCWS as collaborative vehicle for RC managers, site coordinators • E.g. certification process for operating environments; upgrade procedures; configuration management; helpdesk tools, etc. • GGF – production grids area, etc. Ian.Bird@cern.ch
Deployment Summary • Deploy middleware to support essential functionality, but goal is to evolve and incrementally add functionality • Added value is to robustify, support and make into a 24x7 production service • How? • Certification & test procedure – tight feedback to developers • must develop support agreements with grid projects to ensure this • Define missing functionality – require from providers • Provide documentation and training • Provide missing operational services • Provide a 24x7 Operations and Call Centre • Guarantee to respond • Single point of contact for a user • Make software easy to install – facilitate new centres joining • Deployment is a major activity of LCG • Encompasses all operational and practical aspects of a grid • There is a lot of work already done that must be leveraged • Many opportunities for synergy and collaboration Ian.Bird@cern.ch