GridPP: Running a Production Grid

GridPP: Running a Production Grid Stephen Burke, John Gordon CCLRC-RAL HEPiX, Fall 2006, JLab

Overview • EGEE, LCG and GridPP • Middleware • Deployment & Operations • GridPP Specialities • Conclusions JLab HEPiX Fall 2006

EGEE, LCG and GridPP

EGEE • Major EU Grid project: 2004-08 (in two phases) • Successor to the European DataGrid (EDG) project, 2001-04 • 32 countries, 91 partners, €37 million + matching funding • Associated with several Grid projects outside Europe • Expected to be succeeded by a permanent European e-infrastructure • Supports many areas of e-science, but currently High Energy Physics is the major user • Biomedical research is also a pioneer • Currently ~3000 users in 200 Virtual Organisations • Currently 195 sites, 28689 CPUs, 18.4 Pb of storage • Values taken from the information system – beware of GIGO! JLab HEPiX Fall 2006

EGEE/LCG Google map JLab HEPiX Fall 2006

(W)LCG • The computing services for the LHC (Large Hadron Collider) at CERN in Geneva are provided by the LHC Computing Grid (LCG) project • LHC starts running in ~ 1 year • Four experiments, all very large • ~5000 users at 500 sites worldwide, 15 year lifetime • Expect ~15 Pb/year, plus similar volumes of simulated data • Processing requirement is ~100,000 CPUs • Must transfer ~100 Mbyte/sec/site – sustained for 15 years! • Running a series of Service Challenges to ramp up to full scale • LCG uses the EGEE infrastructure, but also the Open Science Grid (OSG) in the US and other Grid infrastructures • Hence WLCG = Worldwide LCG JLab HEPiX Fall 2006

Organisation • EGEE sites are organised by region • GridPP is part of UK/Ireland • Also NGS + Grid Ireland • Each region has a Regional Operation Centre (ROC) to look after the sites in the region • Overall operations co-ordination rotates weekly between ROCs • LCG divides sites into Tier 1/2/3 • + CERN as Tier 0 • Function of size and QOS • Tier 1 needs >97% availability, max 24 hour response • Tier 2 95%/72 hours • Tier 3 are local facilities, no specific targets • ROC ≈ Tier 1: RAL is both, like CNAF, IN2P3, FZK JLab HEPiX Fall 2006

GridPP • Grid for UK Particle Physics • Two phases, 2001-2004-2007 £36M • Proposal for phase 3 to 2011 • Part of EGEE and LCG • Working towards interoperability with UK National Grid Service (NGS) • Exploitation of LHC is the main UK driver • 20 sites, 4354 CPUs, 298 Tb of storage • Currently supports 33 VOs, including some non-PP • But not many non-PP from the UK • For LCG, sites are grouped into four “virtual” Tier 2s • Plus RAL as Tier 1 • Grouping is largely administrative, the Grid sites remain separate as middleware doesn’t easily support federations or distributed sites • Runs UK-Ireland ROC (with NGS) • Grid Operations Centre (GOC) @ RAL (with NGS) • Gridwide configuration, monitoring and accounting repository/portal • Operations and User Support shifts (working hours only) JLab HEPiX Fall 2006

GridPP sites JLab HEPiX Fall 2006

Middleware

Site services • Basis is Globus (still GT2, GT4 soon) and Condor, as packaged in the Virtual Data Toolkit (VDT)– also used by NGS • EGEE/LCG/EDG middleware distribution now under the gLite brand name • Computing Element (CE): Globus gatekeeper + batch system + batch workers • In transition from Globus to Condor-C • Storage Element (SE): Storage Resource Manager (SRM) + GridFTP + other data transports + storage system (disk-only or disk+tape) • Three SRM implementations in GridPP • Berkeley Database Information Index (BDII): LDAP server publishing CE + SE + site + service information according to the GLUE schema • Relational Grid Monitoring Architecture (R-GMA) server: publishing GLUE schema, monitoring, accounting, user information • VOBOX: Container for VO-specific services (aka “edge services”) JLab HEPiX Fall 2006

Core services • Workload Management System (WMS), aka Resource Broker: accepts jobs, dispatches them to sites and manages their lifecycle • Logging & Bookkeeping: primarily logs lifecycle events for jobs • MyProxy: stores long-lived credentials • LCG File Catalogue (LFC): maps logical file names to local names on SEs • File Transfer Service (FTS): provides managed, reliable file transfers • BDII: aggregates information from site BDIIs • R-GMA schema/registry: stores table definitions and lists of producers/consumers • VO Membership Service (VOMS) server: stores VO group/role assignments • User Interface (UI): provides user client tools for the Grid services JLab HEPiX Fall 2006

Grid services • Some extra services are needed to allow the Grid to be operated effectively • Mostly unique instances, not part of the gLite distribution • Grid Operations Centre DataBase (GOCDB): stores information about each site, including contact details, status and a node list • Queried by other tools to generate configuration, monitoring etc • Accounting (APEL): publishes information about CPU and storage use • Various monitoring tools, including: • gstat (Grid status) - collects data from the information system, does sanity checks • Site Availability Monitoring (SAM) - runs regular test jobs at every site, raises alerts and measures availability over time • GridView – collects and displays information about file transfers • Real Time Monitor – displays job movements, and records statistics • Freedom of Choice for Resources (FCR): allows the view of resources in a BDII to be filtered according to VO-specific criteria, e.g. SAM test failures • Operations portal: aggregates monitoring and operational information, broadcast email tool, news, VO information, … JLab HEPiX Fall 2006

SAM monitoring JLab HEPiX Fall 2006

GridView JLab HEPiX Fall 2006

Middleware issues • We need to operate a large production system with 24*7*365 availability • Middleware development is usually done on small, controlled test systems, but the production system is much larger in many dimensions, more heterogeneous and not under any central control • Much of the middleware is still immature, with a significant number of bugs, and developing rapidly • Documentation is sometimes lacking or out of date • There are therefore a number of issues which must be managed by deployment and operational procedures, for example: • The rapid rate of change and sometimes lack of backward compatibility requires careful management of code deployment • Porting to new hardware, operating systems etc can be time consuming • Components are often developed in isolation, so integration of new components can take time • Configuration can be very complex, and only a small subset of possible configurations produce a working system • Fault tolerance, error reporting and logging are in need of improvement • Remote management and diagnostic tools are generally undeveloped JLab HEPiX Fall 2006

Deployment & Operations

Configuration • We have tried many installation & configuration tools over the years • Configuration is complex, but system managers don’t like complex tools! • Most configuration flexibility needs to be “frozen” • Admins don’t understand all the options anyway • Many configuration changes will break something • The more an admin has to type, the more chances for a mistake • Current method preferred by most sites is YAIM (Yet Another Installation Method): • bash scripts • simple configuration of key parameters only • doesn’t always have enough flexibility, but good enough for most cases JLab HEPiX Fall 2006

Release management • There is a constant tension between the desire to upgrade to get new features, and the desire to have a stable system • Need to be realistic about how long it takes to get new things into production • We have so far had a few “big bang” releases per year, but these have some disadvantages • Anything which misses a release has to wait for a long time, hence there is pressure to include untested code • Releases can be held up by problems in any area, hence are usually late • They involve a lot of work for system managers, so it may be several months before all sites upgrade • We are now moving to incremental releases, updating each component as it completes integration and testing • Have to avoid dependencies between component upgrades • Releases go first to a 10%-scale pre-production Grid • Updates every couple of weeks • The system becomes more heterogenous • Still some big bangs – e.g. new OS • Seems OK so far - time will tell! JLab HEPiX Fall 2006

VO support • If sites are going to support a large number of VOs the configuration has to be done in a standard way • Largely true, but not perfect: adding a VO needs changes in several areas • Configuration parameters for VOs should be available on the operations portal, although many VOs still need to add their data • It needs to be possible to install VO-specific software, and maybe services, in a standard way • Software is ~OK: NFS-shared area, writeable by specific VO members, with publication in the information system • Services still under discussion: concerns about security and support • VOs often expect to have dedicated contacts at sites (and vice versa) • May be necessary in some cases but does not scale • Operations portal stores contacts, but site -> VO may not reach the right people – need contacts by area • Not too bad, but still needs some work to find a good modus vivendi JLab HEPiX Fall 2006

Availability • LCG requires high availability, but the intrinsic failure rate is high • Most of the middleware does not deal gracefully with failures • Some failure modes can lead to “black holes” • Must fix/mask failures via operational tools so users don’t see them • Several monitoring tools have been developed, including test jobs run regularly at sites • On-duty operators look for problems, and submit tickets to sites • Currently ~ 50 tickets per week (c.f. 200 sites) • FCR tool allows sites failing specified tests to be made “invisible” • New sites must be certified before they become visible • Persistently failing sites can be decertified • Sites can be removed temporarily for scheduled downtime • Performance is monitored over time • The situation has improved a lot, but we still have some way to go JLab HEPiX Fall 2006

GridPP Strengths

GridPP Specifics • GridPP operates as part of EGEE infrastructure - so does many things the same as other countries/grids • GridPP has also taken the lead on many issues and provided the grid-wide solution • GridSite • R-GMA • APEL cpu and storage • LCG Monitor • Storage Group • T1-T2 Transfers • RB analysis • GOCdb • Dissemination JLab HEPiX Fall 2006

GridSite evolution • Started as web content management • www.gridpp.ac.uk • Authorisation via X.509 certificates, now VOMS, ACLs • “library-ised”, for reuse of GridSite components • EDG/LCG Logging & Bookkeeping; LCAS • GridSite CGI becomes GridSite Apache module • 3rd party CGI/PHP on top of this: GOC etc • Web Services like gLite WM Proxy on CE's • Storage is current expansion area for GridSite JLab HEPiX Fall 2006

JLab HEPiX Fall 2006

R-GMA Producer R-GMA Consumer • A distributed monitoring and information storing infrastructure • SQL-like interface • Used for:- • APEL Accounting • Job Monitoring • CMS Dashboard • RB Statistics • LCG Job Monitor • SFT Tests • Network Monitoring JLab HEPiX Fall 2006

GOCDB • A database which holds configuration, contact and downtime information for each site • Supports multiple grids and national regional structures • Used by • Monitoring tools • Mailing lists • Maps JLab HEPiX Fall 2006

APEL, Job Accounting Flow Diagram • [1] Build Job Accounting Records at site. • [2] Send Job Records to a central repository • [3] Data Aggregation JLab HEPiX Fall 2006

User queries Graphs GOC Job Records In via RGMA Consolidation of Data 1 Record per Grid Job (Millions of records expected) RGMA MON SQL QUERY TO Accounting Server 1 Query / Hour Summary data refreshed every hour (Max records about 100K per year) Home Page On-Demand Accounting Pages based on SQL queries to summary data JLab HEPiX Fall 2006

LHC View: Data Aggregation For VOs per Tier1, per Country JLab HEPiX Fall 2006

Aggregation of Data for GridPP JLab HEPiX Fall 2006

APEL Portal • A repository of accounting information • From APEL sensors • From DGAS HLR • Directly from site accounting databases • From other Grids (OSG) • Looking at standard methods of publishing • OGF RUS • A worldwide view for worldwide VOs • Also working on user level accounting and storage accounting JLab HEPiX Fall 2006

Storage Group • A working group on all aspects of grid storage • Deployed dCache, DPM, Castor SRMs across UK • Tested deployment, interactions • Documentation, support, wiki • Development – GIPs, storage accounting JLab HEPiX Fall 2006

Tier-1 to Tier-2 • GridPP has planned extensive data transfers between its sites • This has shaken out many problems with sites, their data storage, and networking, local and national • UK data transfers >1000Mb/s for 3 days • peak transfer rate from RAL of >1.5Gb/s • Need high data rate transfers to/from RAL as a routine activity JLab HEPiX Fall 2006

LCG Monitor • An active display of jobs running, scheduled, ending on LCG sites • Uses information from Resource Brokers • Now with a 3D version, rotate, zoom JLab HEPiX Fall 2006

If a taxi driver asks you what you do.. Mention the Grid by numbers Or the BBC.. .. and avian flu? Dissemination JLab HEPiX Fall 2006

Conclusions

Lessons learnt • “Good enough” is not good enough • Grids are good at magnifying problems, so must try to fix everything • Exceptions are the norm • 15,000 nodes * MTBF of 5 years = 8 failures a day • Also 15,000 ways to be misconfigured! • Something somewhere will always be broken • But middleware developers tend to assume that everything will work • It needs a lot of manpower to keep a big system going • Bad error reporting can cost a lot of time • And reduce people’s confidence • Very few people understand how the whole system works • Or even a large subset of it • Easy to do things which look reasonable but have a bad side-effect • Communication between sites and users is an n*m problem • Need to collapse to n+m JLab HEPiX Fall 2006

Summary • LHC turns on in 1 year – we must focus on delivering a high QOS • Grid middleware is still immature, developing rapidly and in many cases a fair way from production quality • Experience is that new middleware developments take ~ 2 years to reach the production system, so LHC will start with what we have now • The underlying failure rate is high – this will always be true with so many components, so middleware and operational procedures must allow for it • We need procedures which can manage the underlying problems, and present users with a system which appears to work smoothly at all times • Considerable progress has been made, but there is more to do • GridPP is running a major part of the EGEE/LCG Grid, which is now a very large system operated as a high-quality service, 24*7*365 • We are living in interesting times! JLab HEPiX Fall 2006

GridPP: Running a Production Grid

GridPP: Running a Production Grid

Presentation Transcript

Running CMS software on Grid Testbeds

GridPP – A UK Computing Grid for Particle Physics

GridPP

DiRAC@GRIDPP

Experiences running a production Puppet infrastructure @CERN

The EGEE Production Grid

GridPP

SiGNET – Slovenian Production Grid

Onward and Upward, Grid Futures and GridPP

Grid Operations – Keeping the Grid Running

Asia Pacific Grid: Towards a production Grid

EGEE A Large-scale Production Grid Infrastructure

GridPP Report

Running a Scientific Experiment on the Grid

EGEE – A Large-Scale Production Grid Infrastructure

GridPP Overview

GridPP Overview (emphasis on beyond GridPP)

Grid Interoperability Shootout GridPP and NGS

Production running

GridPP: Running a Production Grid

GridPP Overview