180 likes | 294 Views
Deployment Summary. GridPP11. 15th September 2004. Jeremy Coles J.Coles@rl.ac.uk. Overview. Where are we now?. What is deployment all about anyway?. Who is doing it?. Planning and metrics. Issue 1: Communications. Issue 2: Fabric management. Where are we now?.
E N D
Deployment Summary GridPP11 15th September 2004 Jeremy Coles J.Coles@rl.ac.uk
Overview • Where are we now? • What is deployment all about anyway? • Who is doing it? • Planning and metrics • Issue 1: Communications • Issue 2: Fabric management
Where are we now? Who is flying the plane? Are the developers bailing out? We have paying passengers – do we know where we are going? … oh, and can we keep it working, navigate, land and offer a real service?
Who is flying the plane? … introducing the err…. DTEAM + site system administrators + …
Deployment Board • Replaces GridPP1 Technical Board • Mandate • Determine and oversee execution of tech plan • Report to PMB • Ensure GridPP-wide issues discussed/solved • Provide forum for tech info exchange • Oversee deployment and use of GridPP h/w • Tier1 – Tier2 coordination/liaison • Ensure integration of external tech developments
DB members • Production Manager • Tier1/A Manager • 4 T2 Technical Coordinators • HEP SYSMAN chair • CERN T0/Deployment • Applications Area Coordinator • Middleware Area Coordinator • Technical experts (invited by DB chair) • UK NGS • EGEE/Ireland • DB chair ~18 people
DB relations LCG/EGEE/CERNT0 PMB UK NGS DB UB GridPP DTEAM M/S/N APPS T1AB T2B
What must deployment address? • Core infrastructure services • Resource brokers • Informational services • Data management services • Virtual Organisation management • Replica Location Service • BDII • Grid monitoring • Monitor operational performance • Monitor operational state • Problem resolution + operations support tools • Middleware deployment • Required local validation of common middleware • Feedback issues to LCG/EGEE • Continuous upgrade • Mechanism(s) • Resource induction • New site joining procedures • Provide support for middleware installation • Advise on operational procedures • User support • Provide a support service for users (filter and distribute) • Monitor effectiveness of support • Provide training and induction courses • Documentation (and quality) • Resource support • Respond to and coordinate resolution of fabric problems • Engage wider community to resolve new problems
Areas (2) • Communication • Representation within experiments • Procedures and mechanisms within community • Applications • Ensuring local VOs receive support and guidance • Participate in testing and validation exercises • Network services • Network performance monitoring • Demand (aggregate traffic) vs supply (performance) • Resource allocation/reservation • Components • Workload management • Data management • Storage management • Information services • Inter-grid collaboration • Participate in discussions to work closer with other Grids • Ensure interoperability of infrastructure and services • Service-level agreements • Monitor Tier-2 compliance with MoUs • Access policies • Security • Certification authority • Implement and monitor policy • Incident response • Policy management • Operations planning • Understand usage patterns • Capacity planning • Monitoring problems log
Navigation • No clear plans within LCG for overall deployment – improving • Some confusion about EGEE connections • GridPP2 project plan is not complete and we have dependencies • Currently developing in a “best guess” environment • It is not always clear exactly where decisions get made • What does the planning environment look like so far? • There are already pressing issues to be addressed: • What is the UK stance regarding fabric management tools (LCFGng is being phased out) • How are we going to measure deployment and operations success – metrics • What is the communications plan given that LCG-ROLLOUT has become a gossip column – support, news, problem reporting
Are we communicating…? Areas Grid news – no well defined broadcast route – e.g. middleware updates Site News – operational incidents on Grid, site updates Support – user, deployment Problems – As found by daily tests or discovered by users Issues • LCG-ROLLOUT is overloaded! • Lack of visibility about what is happening at sites – upgrade, site problem • Problems may generate many queries • No tracking for support or logging of queries • … and therefore poor ability to search for other experiences • Options • Set up a new news area based on RSS (new entries are placed in categories that people can register to receive updates from) – just use of GOC pages? • Establish support desk for GridPP – but there are concerns about expertise • DTEAM area & better documentation
An example [LCG-Problems] mail list has 2 members!
Are we going up or down? Metrics Work in progress!
Metrics (2) Work in progress!
Maintenance • Migration to SL3 is starting. • Next public release of LCG supports SL3 WNs, certification complete. • Service nodes remain at RH7.3 for now. • LCFGng is not an option SL3 nodes. • LCG supports one install method for SL3. • Manual install techique (Actually not very manual) • Can be built into any framework already in use • Kickstart and scripts, Cfengine, NPACI Rocks, Quattor, stateless linux or even LCFG • This release expected this month.
Quattor • Community effort for quattor installaion of LCG2 nearing completion. 98% done. • Quattor has similar architecture and concept to LCFG. LCFG effort not wasted. • Advantages • CERN and the RAL Tier1/A will use quattor for LCG. - Support and self help for others available . • LCG M/W will not be tied to or released with quattor. • Disadvantages • A lot to learn before any pay back.
Steve’s 5 questions Should the UK use or at least favour one fabric management solution? Yes – probably Quattor Once SL3 port is available is RH 7.3 still wanted anywhere? Maybe on very few shared sites Is an OS other than SL3 needed for GridPP sites and users? Need to ask experiments – perhaps if CERN upgrades soon Does any site have a conflict with proposed deployment of LCG into SL3? No – most want to move off of RH 7.3 Is there a site to work with RAL learning Quattor? Manchester?
Summary • LCG2 deployed. 1500+ CPUs • Smooth running. Easy and seamless deployments. Service quality • The DTEAM! • The plans (& metrics) are being developed – many dependencies • LCG-ROLLOUT needs to migrate to news & helpdesk services • LCFG will be phased out. Quattor on SLC3 is coming.