240 likes | 373 Views
GridPP: Running a Production Grid. Stephen Burke CLRC/RAL On behalf of the GridPP Deployment & Operations Team UK e-Science All-hands, Nottingham, 21 st September 2006. Overview. EGEE, LCG and GridPP Middleware Deployment & Operations Conclusions. EGEE, LCG and GridPP. EGEE.
E N D
GridPP: Running a Production Grid Stephen Burke CLRC/RAL On behalf of the GridPP Deployment & Operations Team UK e-Science All-hands, Nottingham, 21st September 2006
Overview • EGEE, LCG and GridPP • Middleware • Deployment & Operations • Conclusions Running a Production Grid - All-hands
EGEE • Major EU Grid project: 2004-08 (in two phases) • Successor to the European DataGrid (EDG) project, 2001-04 • 32 countries, 91 partners, €37 million + matching funding • Associated with several Grid projects outside Europe • Expected to be succeeded by a permanent European e-infrastructure • Supports many areas of e-science, but currently High Energy Physics is the major user • Biomedical research is also a pioneer • Currently ~3000 users in 200 Virtual Organisations • Currently 195 sites, 28689 CPUs, 18.4 Pb of storage • Values taken from the information system – beware of GIGO! Running a Production Grid - All-hands
EGEE/LCG Google map Running a Production Grid - All-hands
(W)LCG • The computing services for the LHC (Large Hadron Collider) at CERN in Geneva are provided by the LHC Computing Grid (LCG) project • LHC starts running in ~ 1 year • Four experiments, all very large • ~5000 users at 500 sites worldwide, 15 year lifetime • Expect ~15 Pb/year, plus similar volumes of simulated data • Processing requirement is ~100,000 CPUs • Must transfer ~100 Mbyte/sec/site – sustained for 15 years! • Running a series of Service Challenges to ramp up to full scale • LCG uses the EGEE infrastructure, but also the Open Science Grid (OSG) in the US and other Grid infrastructures • Hence WLCG = Worldwide LCG Running a Production Grid - All-hands
Organisation • EGEE sites are organised by region • GridPP is part of UK/Ireland • Also NGS + Grid Ireland • Each region has a Regional Operation Centre (ROC) to look after the sites in the region • Overall operations co-ordination rotates weekly between ROCs • LCG divides sites into Tier 1/2/3 • + CERN as Tier 0 • Function of size and QOS • Tier 1 needs >97% availability, max 24 hour response • Tier 2 95%/72 hours • Tier 3 are local facilities, no specific targets • ROC ≈ Tier 1: RAL is both Running a Production Grid - All-hands
GridPP • Grid for UK Particle Physics • Two phases, 2001-2004-2007 • Proposal for phase 3 to 2011 • Part of EGEE and LCG • Working towards interoperability with NGS • 20 sites, 4354 CPUs, 298 Tb of storage • Currently supports 33 VOs, including some non-PP • But not many non-PP from the UK – any volunteers? • For LCG, sites are grouped into four “virtual” Tier 2s • Plus RAL as Tier 1 • Grouping is largely administrative, the Grid sites remain separate • Runs UK-Ireland ROC (with NGS) • Grid Operations Centre (GOC) @ RAL (with NGS) • Gridwide configuration, monitoring and accounting repository/portal • Operations and User Support shifts (working hours only) Running a Production Grid - All-hands
GridPP sites Running a Production Grid - All-hands
Site services • Basis is Globus (still GT2, GT4 soon) and Condor, as packaged in the Virtual Data Toolkit (VDT)– also used by NGS • EGEE/LCG/EDG middleware distribution now under the gLite brand name • Computing Element (CE): Globus gatekeeper + batch system + batch workers • In transition from Globus to Condor-C • Storage Element (SE): Storage Resource Manager (SRM) + GridFTP + other data transports + storage system (disk-only or disk+tape) • Three SRM implementations in GridPP • Berkeley Database Information Index (BDII): LDAP server publishing CE + SE + site + service information according to the GLUE schema • Relational Grid Monitoring Architecture (R-GMA) server: publishing GLUE schema, monitoring, accounting, user information • VOBOX: Container for VO-specific services (aka “edge services”) Running a Production Grid - All-hands
Core services • Workload Management System (WMS), aka Resource Broker: accepts jobs, dispatches them to sites and manages their lifecycle • Logging & Bookkeeping: primarily logs lifecycle events for jobs • MyProxy: stores long-lived credentials • LCG File Catalogue (LFC): maps logical file names to local names on SEs • File Transfer Service (FTS): provides managed, reliable file transfers • BDII: aggregates information from site BDIIs • R-GMA schema/registry: stores table definitions and lists of producers/consumers • VO Membership Service (VOMS) server: stores VO group/role assignments • User Interface (UI): provides user client tools for the Grid services Running a Production Grid - All-hands
Grid services • Some extra services are needed to allow the Grid to be operated effectively • Mostly unique instances, not part of the gLite distribution • Grid Operations Centre DataBase (GOCDB): stores information about each site, including contact details, status and a node list • Queried by other tools to generate configuration, monitoring etc • Accounting (APEL): publishes information about CPU and storage use • Various monitoring tools, including: • gstat (Grid status) - collects data from the information system, does sanity checks • Site Availability Monitoring (SAM) - runs regular test jobs at every site, raises alerts and measures availability over time • GridView – collects and displays information about file transfers • Real Time Monitor – displays job movements, and records statistics • Freedom of Choice for Resources (FCR): allows the view of resources in a BDII to be filtered according to VO-specific criteria, e.g. SAM test failures • Operations portal: aggregates monitoring and operational information, broadcast email tool, news, VO information, … Running a Production Grid - All-hands
SAM monitoring Running a Production Grid - All-hands
GridView Running a Production Grid - All-hands
Middleware issues • We need to operate a large production system with 24*7*365 availability • Middleware development is usually done on small, controlled test systems, but the production system is much larger in many dimensions, more heterogeneous and not under any central control • Much of the middleware is still immature, with a significant number of bugs, and developing rapidly • Documentation is sometimes lacking or out of date • There are therefore a number of issues which must be managed by deployment and operational procedures, for example: • The rapid rate of change and sometimes lack of backward compatibility requires careful management of code deployment • Porting to new hardware, operating systems etc can be time consuming • Components are often developed in isolation, so integration of new components can take time • Configuration can be very complex, and only a small subset of possible configurations produce a working system • Fault tolerance, error reporting and logging are in need of improvement • Remote management and diagnostic tools are generally undeveloped Running a Production Grid - All-hands
Configuration • We have tried many installation & configuration tools over the years • Configuration is complex, but system managers don’t like complex tools! • Most configuration flexibility needs to be “frozen” • Admins don’t understand all the options anyway • Many configuration changes will break something • The more an admin has to type, the more chances for a mistake • Current method preferred by most sites is YAIM (Yet Another Installation Method): • bash scripts • simple configuration of key parameters only • doesn’t always have enough flexibility, but good enough for most cases Running a Production Grid - All-hands
Release management • There is a constant tension between the desire to upgrade to get new features, and the desire to have a stable system • Need to be realistic about how long it takes to get new things into production • We have so far had a few “big bang” releases per year, but these have some disadvantages • Anything which misses a release has to wait for a long time, hence there is pressure to include untested code • Releases can be held up by problems in any area, hence are usually late • They involve a lot of work for system managers, so it may be several months before all sites upgrade • We are now moving to incremental releases, updating each component as it completes integration and testing • Have to avoid dependencies between component upgrades • Releases go first to a 10%-scale pre-production Grid • Updates every couple of weeks • The system becomes more heterogenous • Still some big bangs – e.g. new OS • Seems OK so far - time will tell! Running a Production Grid - All-hands
VO support • If sites are going to support a large number of VOs the configuration has to be done in a standard way • Largely true, but not perfect: adding a VO needs changes in several areas • Configuration parameters for VOs should be available on the operations portal, although many VOs still need to add their data • It needs to be possible to install VO-specific software, and maybe services, in a standard way • Software is ~OK: NFS-shared area, writeable by specific VO members, with publication in the information system • Services still under discussion: concerns about security and support • VOs often expect to have dedicated contacts at sites (and vice versa) • May be necessary in some cases but does not scale • Operations portal stores contacts, but site -> VO may not reach the right people – need contacts by area • Not too bad, but still needs some work to find a good modus vivendi Running a Production Grid - All-hands
Availability • LCG requires high availability, but the intrinsic failure rate is high • Most of the middleware does not deal gracefully with failures • Some failure modes can lead to “black holes” • Must fix/mask failures via operational tools so users don’t see them • Several monitoring tools have been developed, including test jobs run regularly at sites • On-duty operators look for problems, and submit tickets to sites • Currently ~ 50 tickets per week (c.f. 200 sites) • FCR tool allows sites failing specified tests to be made “invisible” • New sites must be certified before they become visible • Persistently failing sites can be decertified • Sites can be removed temporarily for scheduled downtime • Performance is monitored over time • The situation has improved a lot, but we still have some way to go Running a Production Grid - All-hands
Lessons learnt • “Good enough” is not good enough • Grids are good at magnifying problems, so must try to fix everything • Exceptions are the norm • 15,000 nodes * MTBF of 5 years = 8 failures a day • Also 15,000 ways to be misconfigured! • Something somewhere will always be broken • But middleware developers tend to assume that everything will work • It needs a lot of manpower to keep a big system going • Bad error reporting can cost a lot of time • And reduce people’s confidence • Very few people understand how the whole system works • Or even a large subset of it • Easy to do things which look reasonable but have a bad side-effect • Communication between sites and users is an n*m problem • Need to collapse to n+m Running a Production Grid - All-hands
Summary • LHC turns on in 1 year – we must focus on delivering a high QOS • Grid middleware is still immature, developing rapidly and in many cases a fair way from production quality • Experience is that new middleware developments take ~ 2 years to reach the production system, so LHC will start with what we have now • The underlying failure rate is high – this will always be true with so many components, so middleware and operational procedures must allow for it • We need procedures which can manage the underlying problems, and present users with a system which appears to work smoothly at all times • Considerable progress has been made, but there is more to do • GridPP is running a major part of the EGEE/LCG Grid, which is now a very large system operated as a high-quality service, 24*7*365 • We are living in interesting times! Running a Production Grid - All-hands