1 / 39

GridPP: Running a Production Grid

This article provides an overview of GridPP, its collaboration with EGEE and LCG, the deployment and operations of the middleware, and its specialties in running a production grid. It also discusses the current status and future plans of GridPP.

nbyler
Download Presentation

GridPP: Running a Production Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GridPP: Running a Production Grid Stephen Burke, John Gordon CCLRC-RAL HEPiX, Fall 2006, JLab

  2. Overview • EGEE, LCG and GridPP • Middleware • Deployment & Operations • GridPP Specialities • Conclusions JLab HEPiX Fall 2006

  3. EGEE, LCG and GridPP

  4. EGEE • Major EU Grid project: 2004-08 (in two phases) • Successor to the European DataGrid (EDG) project, 2001-04 • 32 countries, 91 partners, €37 million + matching funding • Associated with several Grid projects outside Europe • Expected to be succeeded by a permanent European e-infrastructure • Supports many areas of e-science, but currently High Energy Physics is the major user • Biomedical research is also a pioneer • Currently ~3000 users in 200 Virtual Organisations • Currently 195 sites, 28689 CPUs, 18.4 Pb of storage • Values taken from the information system – beware of GIGO! JLab HEPiX Fall 2006

  5. EGEE/LCG Google map JLab HEPiX Fall 2006

  6. (W)LCG • The computing services for the LHC (Large Hadron Collider) at CERN in Geneva are provided by the LHC Computing Grid (LCG) project • LHC starts running in ~ 1 year • Four experiments, all very large • ~5000 users at 500 sites worldwide, 15 year lifetime • Expect ~15 Pb/year, plus similar volumes of simulated data • Processing requirement is ~100,000 CPUs • Must transfer ~100 Mbyte/sec/site – sustained for 15 years! • Running a series of Service Challenges to ramp up to full scale • LCG uses the EGEE infrastructure, but also the Open Science Grid (OSG) in the US and other Grid infrastructures • Hence WLCG = Worldwide LCG JLab HEPiX Fall 2006

  7. Organisation • EGEE sites are organised by region • GridPP is part of UK/Ireland • Also NGS + Grid Ireland • Each region has a Regional Operation Centre (ROC) to look after the sites in the region • Overall operations co-ordination rotates weekly between ROCs • LCG divides sites into Tier 1/2/3 • + CERN as Tier 0 • Function of size and QOS • Tier 1 needs >97% availability, max 24 hour response • Tier 2 95%/72 hours • Tier 3 are local facilities, no specific targets • ROC ≈ Tier 1: RAL is both, like CNAF, IN2P3, FZK JLab HEPiX Fall 2006

  8. GridPP • Grid for UK Particle Physics • Two phases, 2001-2004-2007 £36M • Proposal for phase 3 to 2011 • Part of EGEE and LCG • Working towards interoperability with UK National Grid Service (NGS) • Exploitation of LHC is the main UK driver • 20 sites, 4354 CPUs, 298 Tb of storage • Currently supports 33 VOs, including some non-PP • But not many non-PP from the UK • For LCG, sites are grouped into four “virtual” Tier 2s • Plus RAL as Tier 1 • Grouping is largely administrative, the Grid sites remain separate as middleware doesn’t easily support federations or distributed sites • Runs UK-Ireland ROC (with NGS) • Grid Operations Centre (GOC) @ RAL (with NGS) • Gridwide configuration, monitoring and accounting repository/portal • Operations and User Support shifts (working hours only) JLab HEPiX Fall 2006

  9. GridPP sites JLab HEPiX Fall 2006

  10. Middleware

  11. Site services • Basis is Globus (still GT2, GT4 soon) and Condor, as packaged in the Virtual Data Toolkit (VDT)– also used by NGS • EGEE/LCG/EDG middleware distribution now under the gLite brand name • Computing Element (CE): Globus gatekeeper + batch system + batch workers • In transition from Globus to Condor-C • Storage Element (SE): Storage Resource Manager (SRM) + GridFTP + other data transports + storage system (disk-only or disk+tape) • Three SRM implementations in GridPP • Berkeley Database Information Index (BDII): LDAP server publishing CE + SE + site + service information according to the GLUE schema • Relational Grid Monitoring Architecture (R-GMA) server: publishing GLUE schema, monitoring, accounting, user information • VOBOX: Container for VO-specific services (aka “edge services”) JLab HEPiX Fall 2006

  12. Core services • Workload Management System (WMS), aka Resource Broker: accepts jobs, dispatches them to sites and manages their lifecycle • Logging & Bookkeeping: primarily logs lifecycle events for jobs • MyProxy: stores long-lived credentials • LCG File Catalogue (LFC): maps logical file names to local names on SEs • File Transfer Service (FTS): provides managed, reliable file transfers • BDII: aggregates information from site BDIIs • R-GMA schema/registry: stores table definitions and lists of producers/consumers • VO Membership Service (VOMS) server: stores VO group/role assignments • User Interface (UI): provides user client tools for the Grid services JLab HEPiX Fall 2006

  13. Grid services • Some extra services are needed to allow the Grid to be operated effectively • Mostly unique instances, not part of the gLite distribution • Grid Operations Centre DataBase (GOCDB): stores information about each site, including contact details, status and a node list • Queried by other tools to generate configuration, monitoring etc • Accounting (APEL): publishes information about CPU and storage use • Various monitoring tools, including: • gstat (Grid status) - collects data from the information system, does sanity checks • Site Availability Monitoring (SAM) - runs regular test jobs at every site, raises alerts and measures availability over time • GridView – collects and displays information about file transfers • Real Time Monitor – displays job movements, and records statistics • Freedom of Choice for Resources (FCR): allows the view of resources in a BDII to be filtered according to VO-specific criteria, e.g. SAM test failures • Operations portal: aggregates monitoring and operational information, broadcast email tool, news, VO information, … JLab HEPiX Fall 2006

  14. SAM monitoring JLab HEPiX Fall 2006

  15. GridView JLab HEPiX Fall 2006

  16. Middleware issues • We need to operate a large production system with 24*7*365 availability • Middleware development is usually done on small, controlled test systems, but the production system is much larger in many dimensions, more heterogeneous and not under any central control • Much of the middleware is still immature, with a significant number of bugs, and developing rapidly • Documentation is sometimes lacking or out of date • There are therefore a number of issues which must be managed by deployment and operational procedures, for example: • The rapid rate of change and sometimes lack of backward compatibility requires careful management of code deployment • Porting to new hardware, operating systems etc can be time consuming • Components are often developed in isolation, so integration of new components can take time • Configuration can be very complex, and only a small subset of possible configurations produce a working system • Fault tolerance, error reporting and logging are in need of improvement • Remote management and diagnostic tools are generally undeveloped JLab HEPiX Fall 2006

  17. Deployment & Operations

  18. Configuration • We have tried many installation & configuration tools over the years • Configuration is complex, but system managers don’t like complex tools! • Most configuration flexibility needs to be “frozen” • Admins don’t understand all the options anyway • Many configuration changes will break something • The more an admin has to type, the more chances for a mistake • Current method preferred by most sites is YAIM (Yet Another Installation Method): • bash scripts • simple configuration of key parameters only • doesn’t always have enough flexibility, but good enough for most cases JLab HEPiX Fall 2006

  19. Release management • There is a constant tension between the desire to upgrade to get new features, and the desire to have a stable system • Need to be realistic about how long it takes to get new things into production • We have so far had a few “big bang” releases per year, but these have some disadvantages • Anything which misses a release has to wait for a long time, hence there is pressure to include untested code • Releases can be held up by problems in any area, hence are usually late • They involve a lot of work for system managers, so it may be several months before all sites upgrade • We are now moving to incremental releases, updating each component as it completes integration and testing • Have to avoid dependencies between component upgrades • Releases go first to a 10%-scale pre-production Grid • Updates every couple of weeks • The system becomes more heterogenous • Still some big bangs – e.g. new OS • Seems OK so far - time will tell! JLab HEPiX Fall 2006

  20. VO support • If sites are going to support a large number of VOs the configuration has to be done in a standard way • Largely true, but not perfect: adding a VO needs changes in several areas • Configuration parameters for VOs should be available on the operations portal, although many VOs still need to add their data • It needs to be possible to install VO-specific software, and maybe services, in a standard way • Software is ~OK: NFS-shared area, writeable by specific VO members, with publication in the information system • Services still under discussion: concerns about security and support • VOs often expect to have dedicated contacts at sites (and vice versa) • May be necessary in some cases but does not scale • Operations portal stores contacts, but site -> VO may not reach the right people – need contacts by area • Not too bad, but still needs some work to find a good modus vivendi JLab HEPiX Fall 2006

  21. Availability • LCG requires high availability, but the intrinsic failure rate is high • Most of the middleware does not deal gracefully with failures • Some failure modes can lead to “black holes” • Must fix/mask failures via operational tools so users don’t see them • Several monitoring tools have been developed, including test jobs run regularly at sites • On-duty operators look for problems, and submit tickets to sites • Currently ~ 50 tickets per week (c.f. 200 sites) • FCR tool allows sites failing specified tests to be made “invisible” • New sites must be certified before they become visible • Persistently failing sites can be decertified • Sites can be removed temporarily for scheduled downtime • Performance is monitored over time • The situation has improved a lot, but we still have some way to go JLab HEPiX Fall 2006

  22. GridPP Strengths

  23. GridPP Specifics • GridPP operates as part of EGEE infrastructure - so does many things the same as other countries/grids • GridPP has also taken the lead on many issues and provided the grid-wide solution • GridSite • R-GMA • APEL cpu and storage • LCG Monitor • Storage Group • T1-T2 Transfers • RB analysis • GOCdb • Dissemination JLab HEPiX Fall 2006

  24. GridSite evolution • Started as web content management • www.gridpp.ac.uk • Authorisation via X.509 certificates, now VOMS, ACLs • “library-ised”, for reuse of GridSite components • EDG/LCG Logging & Bookkeeping; LCAS • GridSite CGI becomes GridSite Apache module • 3rd party CGI/PHP on top of this: GOC etc • Web Services like gLite WM Proxy on CE's • Storage is current expansion area for GridSite JLab HEPiX Fall 2006

  25. JLab HEPiX Fall 2006

  26. R-GMA Producer R-GMA Consumer • A distributed monitoring and information storing infrastructure • SQL-like interface • Used for:- • APEL Accounting • Job Monitoring • CMS Dashboard • RB Statistics • LCG Job Monitor • SFT Tests • Network Monitoring JLab HEPiX Fall 2006

  27. GOCDB • A database which holds configuration, contact and downtime information for each site • Supports multiple grids and national regional structures • Used by • Monitoring tools • Mailing lists • Maps JLab HEPiX Fall 2006

  28. APEL, Job Accounting Flow Diagram • [1] Build Job Accounting Records at site. • [2] Send Job Records to a central repository • [3] Data Aggregation JLab HEPiX Fall 2006

  29. User queries Graphs GOC Job Records In via RGMA Consolidation of Data 1 Record per Grid Job (Millions of records expected) RGMA MON SQL QUERY TO Accounting Server 1 Query / Hour Summary data refreshed every hour (Max records about 100K per year) Home Page On-Demand Accounting Pages based on SQL queries to summary data JLab HEPiX Fall 2006

  30. LHC View: Data Aggregation For VOs per Tier1, per Country JLab HEPiX Fall 2006

  31. Aggregation of Data for GridPP JLab HEPiX Fall 2006

  32. APEL Portal • A repository of accounting information • From APEL sensors • From DGAS HLR • Directly from site accounting databases • From other Grids (OSG) • Looking at standard methods of publishing • OGF RUS • A worldwide view for worldwide VOs • Also working on user level accounting and storage accounting JLab HEPiX Fall 2006

  33. Storage Group • A working group on all aspects of grid storage • Deployed dCache, DPM, Castor SRMs across UK • Tested deployment, interactions • Documentation, support, wiki • Development – GIPs, storage accounting JLab HEPiX Fall 2006

  34. Tier-1 to Tier-2 • GridPP has planned extensive data transfers between its sites • This has shaken out many problems with sites, their data storage, and networking, local and national • UK data transfers >1000Mb/s for 3 days • peak transfer rate from RAL of >1.5Gb/s • Need high data rate transfers to/from RAL as a routine activity JLab HEPiX Fall 2006

  35. LCG Monitor • An active display of jobs running, scheduled, ending on LCG sites • Uses information from Resource Brokers • Now with a 3D version, rotate, zoom JLab HEPiX Fall 2006

  36. If a taxi driver asks you what you do.. Mention the Grid by numbers Or the BBC.. .. and avian flu? Dissemination JLab HEPiX Fall 2006

  37. Conclusions

  38. Lessons learnt • “Good enough” is not good enough • Grids are good at magnifying problems, so must try to fix everything • Exceptions are the norm • 15,000 nodes * MTBF of 5 years = 8 failures a day • Also 15,000 ways to be misconfigured! • Something somewhere will always be broken • But middleware developers tend to assume that everything will work • It needs a lot of manpower to keep a big system going • Bad error reporting can cost a lot of time • And reduce people’s confidence • Very few people understand how the whole system works • Or even a large subset of it • Easy to do things which look reasonable but have a bad side-effect • Communication between sites and users is an n*m problem • Need to collapse to n+m JLab HEPiX Fall 2006

  39. Summary • LHC turns on in 1 year – we must focus on delivering a high QOS • Grid middleware is still immature, developing rapidly and in many cases a fair way from production quality • Experience is that new middleware developments take ~ 2 years to reach the production system, so LHC will start with what we have now • The underlying failure rate is high – this will always be true with so many components, so middleware and operational procedures must allow for it • We need procedures which can manage the underlying problems, and present users with a system which appears to work smoothly at all times • Considerable progress has been made, but there is more to do • GridPP is running a major part of the EGEE/LCG Grid, which is now a very large system operated as a high-quality service, 24*7*365 • We are living in interesting times! JLab HEPiX Fall 2006

More Related