WLCG – Worldwide LHC Computing Grid

WLCG – Worldwide LHC Computing Grid Status Report LHCC Open Session, September 25th 2007

Agenda • State of Readiness of WLCG Service • Definitions of “Readiness” • What is being done to correct any problems • How we will track our progress • Some general words on the project itself (wrt service…) • Conclusions

What is Grid Computing? • Today there are many definitions of Grid computing: • The definitive definition of a Grid is provided by [1] Ian Foster in his article "What is the Grid? A Three Point Checklist"[2]. • The three points of this checklist are: • Computing resources are not administered centrally; • Open standards are used; • Non-trivial quality of service is achieved.

WLCG depends on two major science grid infrastructures …. EGEE - Enabling Grids for E-Science OSG - US Open Science Grid

Is WLCG a Grid? • I contend that it satisfies the first two criteria: • The two major Grids on which it is based are clearly separate management domains and … • At least a workable degree of de-facto standards is needed for successful production services to be offered. • But what of the Service? “A Grid allows its constituent resources to be used in a coordinated fashion to deliver various qualities of service, relating for example to response time, throughput, availability, and security, and/or co-allocation of multiple resource types to meet complex user demands, so that the utility of the combined system is significantly greater than that of the sum of its parts.”

Tier-0 – the accelerator centre • Data acquisition & initial processing • Long-term data curation • Distribution of data  Tier-1 centres Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF/SARA (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia Sinica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) WLCG – The Collaboration = 4 Experiments + 11 Tier-1 Centres – “online” to the data acquisition process  high availability • Managed Mass Storage – grid-enabled data service • Data-heavy analysis • National, regional support Tier-2 – 112 Centres in 53 Federations in 26 countries • End-user (physicist, research group) analysis – where the discoveries are made • Simulation

Baseline Services The Basic Baseline Services – from the TDR (2005) • Storage Element • Castor, dCache, DPM (with SRM 1.1) • Storm added in 2007 • SRM 2.2 – spec. agreed in May 2006 -- being deployed now • Basic transfer tools – Gridftp, .. • File Transfer Service (FTS) • LCG File Catalog (LFC) • LCG data mgt tools - lcg-utils • Posix I/O – • Grid File Access Library (GFAL) • Synchronised databases T0T1s • 3D project • Information System • Compute Elements • Globus/Condor-C • web services (CREAM) • gLite Workload Management • in production at CERN • VO Management System (VOMS) • VO Boxes • Application software installation • Job Monitoring Tools

WLCG - The Requirements • Resource requirements, e.g. ramp-up in TierNCPU, disk, tape and network • Look at the Computing TDRs; • Look at the resources pledged by the sites (MoU etc.); • Look at the plans submitted by the sites regarding acquisition, installation and commissioning; • Measure what is currently (and historically) available; signal anomalies. • Functional requirements, in terms of services and service levels, including operations, problem resolution and support • Implicit / explicit requirements in Computing Models; • Agreements from Baseline Services Working Group and Task Forces; • Service Level definitions in MoU; • Measure what is currently (and historically) delivered; signal anomalies. • Data transfer rates – the TierX TierY matrix • Understand Use Cases; • Measure …

Reminder – one of the conclusions from the plenary talk at CHEP’04 by Fabiola Gianotti

WLCG Commissioning Schedule • Still an ambitious programme ahead • Timely testing of full data chain from DAQ to T-2 chain was major item from last CR • DAQ T-0 still largely untested LHCC Comprehensive Review – September 2006

The State of the Services, or Are We Ready? John Gordon, STFC-RAL WLCG Workshop, Victoria, September 2007

Are we ready for what? • Data-taking in summer 2008? • No • Dress rehearsals in 2007 • Probably yes • But still lots to do

Ready according to whom? • Middleware Developers • Grid Operations • Sites • Experiments • Unlikely to get the same answers from each • I take the view of Grid Operations • And get the view of sites and experiments from this workshop

Services WN CE SE RB/WMS 3D LFC VOMS FTS2.0 SRM22 Other Issues SL4 Job Priorities Pilot Jobs /glexec on WN So What Are The Services?

Component Service Readiness

Summary of Service Readiness

Study of SRM 2.2 specification • In September 2006 very different interpretations of the spec • 6 Releases of the SRM v2.2 specification document: July, September, December 2006 and January(2x), April 2007 • Study of the spec (state/activity diagrams): many unspecified behaviours. • A list of about 50 open issues has been compiled in September 2006. • Last 30 points discussed and agreed during the WLCG Workshop in January 2007. Other major points delayed to SRM 3.0. • The study of the specifications, the discussions and testing of the open issues have helped ensure coherence in the protocol definition and consistency between SRM implementations. https://twiki.cern.ch/twiki/bin/view/SRMDev/IssuesInTheSpecifications Flavia Donno, CHEP 2007 – Victoria, CANADA

SRM v2.2 implementations • CASTOR2 : developed by CERN and RAL. SRM v2.2 support in v.2.1.4. • dCache : developed by DESY and FNAL. SRM 2.2 support in v1.8. • DPM: developed by CERN. SRM v2.2 support in v1.6.5 in production. • StoRM : developed by INFN and ICTP. SRM v2.2 interface for many filesystems: GPFS, Lustre, XFS and POSIX generic filesystem. SRM v2.2 support in v1.3.15. • BeStMan : developed by LBNL. SRM v2.2 support in v2.2.0.0. Flavia Donno, CHEP 2007 – Victoria, CANADA

SRM v2.2 – GDB Review • All implementations delivered for testing • Bugs, inconsistencies, configuration issues • Critical issues identified • Experiment testing one month late • Good progress but not there yet. • This was the status < 1 month ago – much of the detail has since changed (for the better…) • We are now agreeing concrete production deployment dates by named site • Experiment testing is still required – as many problems as possible should be fixed before production – some will only be found and fixed later

SRM v2.2 Production Deployment • Details of SRM v2.2 production at CERN now being finalised. Plan is for one ‘endpoint’ per LHC experiment, plus a public one for the rest (as for CASTOR2). • Target < end October 2007 • Tier1s running CASTOR – wait at least one month after CERN • SRM v2.2 will be deployed at FZK during the week of November 5 with experts on site. Other major Tier1s (and some Tier2s – e.g. DESY) will follow up until end 2007. • Remaining sites – including those that source dCache through OSG – will be upgraded by end-Feb 2008. • DPM is already available for Tier2s; STORM also for INFN(+) sites • CCRC’08 (described later) is foreseen to run on SRM v2.2 (Feb + May)

SRM v2.2 - Summary • The ‘SRM saga’ has clearly been much longer than desirable – and at times fraught • Special thanks are due to all those involved, for their hard-work over an extended period and (in anticipation) for the delivery of successful production services • When the time is right, it would make sense to learn from this exercise… • Large-scale collaboration is part of our world: • What did we do well? What could we improve?

FTS status • FTS infrastructure generally runs well • Deployed at CERN and T1 sites • Sites understand the software • Most problems ironed out in the last 2 years’ service challenges and experiment data challenges • Remainder of the problems are understood with experiments and sites and there is a plan to address them • Still problems with distributed service operations • SRMs / gridFTP / FTS / experiment frameworks: many layers and many places where things break • We use FTS as our tool to debug and understand the distributed ‘service’

FTS 2.0 improvements • Service level and performance: • Improved service administration tools • Improved service monitoring capabilities • Placeholders for future functionality (minimise impact of planned future service upgrades) • More efficient DB model • Service resilience analysis made – it’s understood how to deploy FTS to maximise its availability • Functionality: • Better security model (same model as WMS) • SRM 2.2 support • Release timescales for FTS 2.0: • Released for experiment stress-test on pilot mid-March • Deployed on CERN production service mid-June • Released to T1 sites at start August 2007

FTS: Current plans • Continue making FTS monitoring data available • It measures every transfer in the system: lots of useful information for improving the overall service • Done within the context of WLCG monitoring WG • More operational improvements (requested by T1 sites + CERN) • Match better the FTS channel topology to experiment computing model (e.g. CMS) • “Cloud” channels • Closer integration of FTS with experiment computing frameworks • FCR blacklists • Asynchronous notifications of finished jobs • Cross-integration with experiment monitoring • Stress-test prototyping model used for FTS 2.0 pilot was very useful! • Plan to use same model to prototype these new features

Multi-Replica Setup • LFC Production Replica • LHCb computing model foresees 6 LFC read only replicas at T1s: CNAF, GRIDKA , IN2P3, PIC, RAL, SARA. • Database • replica • connected to • CERN • via • Streams • but LFC • frontends • not • setup  • At the moment: • One LFC replica in production at CNAF: frontend and backend deployed • LFC replica backends connected to CERN, but LFC frontend not yet deployed at GRIDKA , IN2P3, PIC, RAL. • LFC database replica not yet deployed at SARA  • Replica setup to be done Barbara Martelli CHEP 2007

All Ten ATLAS Tier-1 Sites in Production Operation • Leveraging the 3D Project infrastructure, ATLAS Conditions DB worldwide replication is now in production with real data (from detector commissioning) and data from MC simulations: • Snapshot of real-time monitoring of 3D operations on EGEE Dashboard: • Using 3D infrastructure ATLAS is running a ‘mini Calibration Data Challenge’: • regular conditions data updates on Online RAC, testing propagation to Offline RAC and further to ten Tier-1s • since April ~2500 runs, 110 GB of COOL data replicated to Tier-1s at rates of 1.9 GB/day These successful ATLAS operations contributed to a major WLCG milestone Hans von der Schmitt Alexandre Vaniachine

Database Operations Monitoring is in Place • ATLAS database applications require robust operational infrastructure for data replication between online and offline at Tier-0, and for the distribution of the offline data to Tier-1 and Tier-2 computing centers • Monitoring is critical to accomplish that: Conditions DB files replications monitoring Credits: Alexei Klimentov ATLAS service monitoring Credits: Florbela Viegas Hans von der Schmitt Alexandre Vaniachine

ATLAS targets Scalability Tests at T1s: CNAF, Bologna & CC IN2P3 Lyon • ATLAS Oracle scalability tests indicate that WLCG 3D capacities in deployment for ATLAS are in the ballpark of what ATLAS requires Various realistic conditions data workload combinations were used Alexandre Vaniachine

In the Ballpark • We estimate that ATLAS daily reconstruction and/or analysis jobs rates will be in the range from 100,000 to 1,000,000 jobs/day • Current ATLAS production finishes up to 55,000 jobs/day • For each of ten Tier-1 centers that corresponds to the rates of 400 to 4,000 jobs/hour • For many Tier-1s pledging ~5% capacities (vs. 1/10th of the capacities) that corresponds to the rates of 200 to 2,000 jobs/hour • With most of these will be analysis or simulation jobs which do not need so much Oracle Conditions DB access • Thus, our results from the initial scalability tests are promising • We got initial confirmation that ATLAS capacities request to WLCG (3-node clusters at all Tier-1s) is close to what will be needed for reprocessing in the first year of ATLAS operations Richard Hawkings Alexandre Vaniachine

Overview Offline Tier0 Farm Squid 1 Offline FroNTier Launch- pad Wide Area Network ORCOFF Squid 4 SQUID Hierarchy HLT Filter Farm Online (Point 5) HLT FroNTier Server(s) ORCON Frontier Performance

Specific Challenges • HLT(High Level Trigger) • Startup time for Cal/Ali < 10 seconds. • Simultaneous • Uses hierarchy of squid caches • Tier0(Prompt Reconstruct) • Startup time for conditions load < 1% of total job time. • Usually staggered • DNS Round Robin should scale to 8 squids * Worst case scenario Frontier Performance

Support Infrastructure HelpDesk, GGUS Operator Service Manager On Duty SysAdmin CASTOR Service Expert CASTOR Developer 43

Number of calls per week Plus ~10 calls via support lists, direct e-mails, phone calls, … HelpDesk, GGUS Operator 127 18 Service Manager On Duty SysAdmin 6 5 CASTOR Service Expert 0.5 Castor Developer 44

Experiment Readiness Panel (Sep 1)

Conclusions on m/w & Services iSGTW Feature - ATLAS: the data chain works This month particle physics experiment ATLASwent “end-to-end” for the first time. …And this month, for the first time, ATLAS proved that this data distribution—from the LHC to physicists across the globe—will be possible. Middleware & services: Initial goals over-ambitious – but we now have basic functionality, tools, services SRM 2.2 is late – and storage management is hard Experiments have to live with the functionality that we have now Usage: Experiments are running large numbers of jobs – despite their (justified) complaints And transferring large amounts of data – though not always to where they want it ATLAS has taken cosmic data from the detector to analysis at Tier-2s End-users beginning to run analysis jobs – but sites need to understand much better how analysis will be done during the first couple of years  and what the implications are for data

The Worldwide LHC Computing GridHow will we address problems?

Scalability: • 5-6 X needed for resource capacity, number of jobs • 2-3 X needed for data transfer • Live for now with the functionality we have • Need to understand better how analysis will be done Reliability: • Not yet good enough • Data Transfer is still the most worrying - despite many years of planning and testing • Many errors  complicated recovery procedures • Many sources of error – storage systems, site operations, experiment data management systems, databases, grid middleware and services, networks, .... Hard to get to the roots of the problems

WLCG Tier0 Service Review • Concentrates on Grid services – needs to be extended to include “Critical Services” as seen by experiments (CMS link) • This is the viewpoint that counts • Which includes also non-Grid services : e.g. AFS etc. (Indico? Phones??) • Shows varying degree of robustness to glitches and common service interventions • Some clear areas for improvement • Establishes a clear baseline on which we can build – using a small number of well-understood techniques – to provide services addressing experiments’ needs in terms of reliability • Load-balanced servers; DB clusters; m/w support for these! • To be continued… • Extended Tier1s and major Tier2s… • See November workshop on Service Reliability Issues • Presentation to OB on October 8

CMS Critical Services

Monitoring BOF • The progress done by WLCG monitoring WGs was reported & discussed. • System Management Working group is concentrated on improving of the support site administrators in their everyday work. • Better documentation, sharing information, improvement of help for troubleshooting. • Information is made available via the twiki page: http://www.sysadmin.hep.ac.uk/ • Grid Service Monitoring Working Group was working on the Nagios-based prototype for monitoring of the Grid services at the local sites. This work is progressing well. • There was a discussion about calculation of the site availability based on the results of SAM tests. Experiments expressed their concern that site availability does not take into account experiment specific tests. • System Analysis working group reported progress on the monitoring of the jobs submitted via condor_g. Julia Andreeva

In a Nutshell… (proposal!!!)

What if: LHC is operating and experiments take data? All experiments want to use the computing infrastructure simultaneously? The data rates and volumes to be handled at the Tier0, the Tier1 and Tier2 centers are the sum of ALICE, ATLAS, CMS and LHCb as specified in the experiments computing model Each experiment has done data challenges, computing challenges, tests, dress rehearsals, …. at a schedule defined by the experiment This will stop: we will no longer be the master of our schedule……. Once LHC starts to operate. We need to prepare for this … together …. A combined challenge by all Experiments should be used to demonstrate the readiness of the WLCG Computing infrastructure before start of data taking at a scale comparable to the data taking in 2008. This should be done well in advance of the start of data taking on order to identify flaws, bottlenecks and allow to fix those. We must do this challenge as WLCG collaboration: Centers and Experiments CCRC ’08 - Motivation and Goals WLCG Workshop: Common VO Challenge

Test data transfers at 2008 scale: Experiment site to CERN mass storage CERN to Tier1 centers Tier1 to Tier1 centers Tier1 to Tier2 centers Tier2 to Tier2 centers Test Storage to Storage transfers at 2008 scale: Required functionality Required performance Test data access at Tier0, Tier1 at 2008 scale: CPU loads should be simulated in case this impacts data distribution and access Tests should be run concurrently CMS proposes to use artificial data Can be deleted after the Challenge CCRC ’08 - Proposed Scope (CMS) WLCG Workshop: Common VO Challenge

Mass storage systems are prepared SRM2.2 deployed at all participating sites CASTOR, dCache and other data management systems installed with appropriate version Data transfers are commissioned for CMS Only commissioned links can be used Participating centers have 2008 capacity Constraints & Preconditions WLCG Workshop: Common VO Challenge

Duration of challenge: 4 weeks Based on the current CMS schedule: Window of opportunity during February 2008 In March a full detector COSMICS Run is scheduled With all components and magnetic fieldThis is the the first time with the final detector geometry Document performance and lessons learned within 4 weeks. CCRC ’08 - Proposed Schedule WLCG Workshop: Common VO Challenge

Coordination: (1+4+nT1) WLCG overall coordination (1) Maintains overall schedule Coordinate the definition of goals and metrics Coordinates regular preparation meetings During the CCRC’08 coordinates operations meetings with experiments and sites Coordinates the overall success evaluation Each Experiment: (4) Coordinates the definition of the experiments goals and metrics Coordinates experiments preparations Applications for load driving(Certified and tested before the challenge) During the CCRC’08 coordinates the experiments operations Coordinates the experiments success evaluation Each Tier1 (nT1) Coordinates the Tier1 preparation and the participation Ensures the readiness of the center at the defined schedule Contributes to summary document CCRC’08 - Proposed Organization WLCG Workshop: Common VO Challenge

CCRC - Summary • The need for a Common Computing Readiness Challenge has been clearly stated by ATLAS & CMS • Ideally, ALICE & LHCb should also participate at full nominal 2008 pp rates • The goals & requirements – such as production SRM v2.2 – are common • Two slots have been proposed: Feb & May ’08 • Given the goals & importance of this challenge, foresee to use both slots • Feb: pre-challenge; ensure pre-conditions are met; identify potentially problematic areas • Can be <100% successful • May: THE challenge; • Must succeed! • Need to carefully prepare – which means thorough testing of all components and successive integration prior to the full challenge • In additional to the technical requirements, must ensure adequate {carbon, silicon} resources are available throughout these periods • Neither of these slots is optimal in this respect, but when is? • Need to understand how to provide production coverage at all times! Must be pragmatic – focus on what can (realistically) be expected to work!

{Service, Site, Experiment} Readiness - Summary • Significant progress has been made in the last year in both ‘residual services’ as well as complete WLCG ‘service stack’ • Need to make similar (or greater) progress in coming year on site / experiment readiness! • We have shown that we can do it – but its hard work and requires a concentrated effort • e.g. service challenges; residual service delivery, … • Data Movement (management) continues to be the key area of concern • SRM v2.2 production deployment is now (finally) imminent!

Are we getting there? Slowly! After so many years --- the beams are now on the horizon & we can all focus onthe contribution that we can make to extracting the physics Need continuous testing from now until first beams Driven by experiments with realistic scenarios, good monitoring and measurements and the pro-active participation of sites, developers, storage experts

Conclusions • The “Residual Services” identified at last year’s LHCC Comprehensive Review have (largely) been delivered – or are in the final stages of preparation • We are now in a much better position wrt monitoring, accounting and reporting – running the service • Data taking with cosmics, Dress Rehearsals and CCRC’08 will further shake-down the service • Ramping up in reliability, throughput and capacity are key priorities • We are on target – but not ahead – for first pp collisions • A busy – but rewarding – year ahead!

WLCG – Worldwide LHC Computing Grid

WLCG – Worldwide LHC Computing Grid

Presentation Transcript