340 likes | 458 Views
The WLCG Service Starts Here…. SC4 Production == WLCG Pilot --- Jamie Shiers IT-GD Group Meeting, July 7 th 2006. My (CERN) Background. Started at CERN in CO group back in 1984 Started at CERN as student in 1978… Since then, we’ve had the following major accelerator startups:
E N D
The WLCG Service Starts Here… SC4 Production == WLCG Pilot --- Jamie Shiers IT-GD Group Meeting, July 7th 2006
My (CERN) Background • Started at CERN in CO group back in 1984 • Started at CERN as student in 1978… • Since then, we’ve had the following major accelerator startups: • pp collider at CERN; • LEP; • FNAL collider runs I & II; • SLC at SLAC; • (others too…) • Enjoy the calm, relaxing environment you currently enjoy.. • (The quiet before the storm…)
The Worldwide LHC Computing Grid • Purpose • Develop, build and maintain a distributed computing environment for the storage and analysis of data from the four LHC experiments • Ensure the computing service • … and common application libraries and tools • Phase I – 2002-05 - Development & planning • Phase II – 2006-2008 – Deployment & commissioning of the initial services The solution!
Overview • SC4 Phases: • Throughput Phase (April) • May was reserved for gLite 3.0 upgrades • Service Phase (June – September inclusive) • Experiment production activities / requirements • WLCG Production Service • In principle October on… • ATLAS CSC / CMS CSA06 start early / mid September • Some comments on Tier2 workshop • Much more complete review at Wednesday’s GDB WLCG Service Challenges: Overview and Outlook
CHEP 92 – the Birth of OO in HEP? • Wide-ranging discussions on the future of s/w development in HEP • A number of proposals presented leading to (DRDC/LCRB/LCB): • RD41 – MOOSE [ Kors Bos ] • The applicability of OO to offline particle physics code • RD44 – GEANT4 [ Simone Giani ] • Produce a global object-oriented analysis and design of an improved GEANT simulation toolkit for HEP • RD45 – A Persistent Object Manager for HEP [ JDS ] • (and later also LHC++ (subsequently ANAPHE)) [ JDS ] • ROOT [ René ] • Started working on LHC Computing full-time!
2006 2007 2008 LCG Service Deadlines Pilot Service – stable service from 1 June 06 i.e. we have already taken off! LCG Service in operation– 1 Oct 06over following six months ramp up to full operational capacity & performance cosmics first physics LCG service commissioned – 1 Apr 07 ~6 months prior to first collisions Updated LHC schedule coming… full physics run
The LHC Machine • Some clear indications regarding LHC startup schedule and operation are now available • Press release issued two weeks ago • Comparing our (SC) actual status with ‘the plan’, we are arguably one year late! • Some sites cheerfully claim two… • We were supposed to test all offline Use Cases of experiments during SC3 production phase (Sep 2005) • We still have an awful lot of work to do • Not the time to relax! WLCG Service Challenges: Overview and Outlook
Press Release - Extract • CERN confirms LHC start-up for 2007 • Geneva, 23 June 2006. First collisions in the … LHC … in November 2007 said … Lyn Evans at the 137th meeting of the CERN Council ... • A two month run in 2007, with beams colliding at an energy of 0.9 TeV will allow the LHC accelerator and detector teams to run-in their equipment ready for a full 14 TeV energy run to start in Spring 2008 • Service Challenge ’07? • The schedule announced today ensures the fastest route to a high-energy physics run with substantial quantities of data in 2008, while optimising the commissioning schedules for both the accelerator and the detectors that will study its particle collisions. It foresees closing the LHC’s 27 km ring in August 2007 for equipment commissioning. Two months of running, starting in November 2007, will allow the accelerator and detector teams to test their equipment with low-energy beams. After a winter shutdown in which commissioning will continue without beam, the high-energy run will begin. Data collection will continue until a pre-determined amount of data has been accumulated, allowing the experimental collaborations to announce their first results. WLCG Service Challenges: Overview and Outlook
LHC Commissioning Expect to be characterised by: • Poorly understood detectors, calibration, software, triggers etc. • Lower than design luminosity & energy (~injection energy) • Most likely no AOD or TAG from first pass – but ESD will be larger? • Possible large impact on Tier2s – RAW and ESD samples to Tier2s? • The pressure will be on to produce some results as soon as possible! • There will not be sufficient resources at CERN to handle the load • We need a fully functional distributed system - ENTER THE GRID • There are many Use Cases we did not yet clearly identify • Nor indeed test --- this remains to be done in the coming months!
Breakdown of a normal year - From Chamonix XIV - 7-8 Service upgrade slots? ~ 140-160 days for physics per year Not forgetting ion and TOTEM operation Leaves ~ 100-120 days for proton luminosity running ? Efficiency for physics 50% ? ~ 50 days ~ 1200 h ~ 4 106 s of proton luminosity running / year R.Bailey, Chamonix XV, January 2006
Startup physics (ALICE) Can publish two papers 1-2 weeks after LHC startup • Multiplicity paper: • Introduction • Detector system • - Pixel (& TPC) • Analysis method • Presentation of data • - dN/dη and mult. distribution (s dependence) • Theoretical interpretation • - ln2(s) scaling?, saturation, multi-parton inter… • Summary • pT paper outline: • Introduction • Detector system • - TPC, ITS • Analysis method • Presentation of data • - pT spectra and pT-multiplicity correlation • Theoretical interpretation • - soft vs hard, mini-jet production… • Summary WLCG Service Challenges: Overview and Outlook
Tier0 – the accelerator centre (that’s us) • Data acquisition & initial processing • Long-term data curation • Distribution of data Tier1s • This is where FTS comes in… Tier1 – “online” to the data acquisition process high availability • Managed Mass Storage – grid-enabled data service • Data intensive analysis • National, regional support • Continual reprocessing activity(or is that continuous?) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia Sinica (Taipei) UK – CLRC (Didcot) US – FermiLab (Illinois) – Brookhaven (NY) Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschungszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF (Amsterdam) LCG Service Model Tier2 – ~100 centres in ~40 countries • Simulation • End-user analysis – batch and interactive Les Robertson
CPU Disk Tape
Easter w/e Target 10 day period SC4 T0-T1: Results • Target: sustained disk – disk transfers at 1.6GB/s out of CERN at full nominal rates for ~10 days • Result:just managed this rate on Good Sunday (1/10)
Easter Sunday: > 1.6GB/s including DESY GridView reports 1614.5MB/s as daily average
Service Challenges - Reminder • Purpose • Understand what it takes to operate a real grid service – run for weeks/months at a time (not just limited to experiment Data Challenges) • Trigger and verify Tier-1 & large Tier-2 planning and deployment – - tested with realistic usage patterns • Get the essential grid services ramped up to target levels of reliability, availability, scalability, end-to-end performance • Four progressive steps from October 2004 thru September 2006 • End 2004 - SC1 – data transfer to subset of Tier-1s • Spring 2005 – SC2 – include mass storage, all Tier-1s, some Tier-2s • 2nd half 2005 – SC3 – Tier-1s, >20 Tier-2s – first set of baseline services • Jun-Sep 2006 – SC4 – pilot service Autumn 2006 – LHC service in continuous operation – ready for data taking in 2007
SC4 – Executive Summary We have shown that we can drive transfers at full nominal rates to: • Most sites simultaneously; • All sites in groups (modulo network constraints – PIC); • At the target nominal rate of 1.6GB/s expected in pp running In addition, several sites exceeded the disk – tape transfer targets • There is no reason to believe that we cannot drive all sites at or above nominal rates for sustained periods. But • There are still major operational issues to resolve – and most importantly – a full end-to-end demo under realistic conditions
Experiment Plans for SC4 • All 4 LHC experiments will run major production exercises during WLCG pilot / SC4 Service Phase • These will test all aspects of the respective Computing Models plus stress Site Readiness to run (collectively) full production services • These plans have been assembled from the material presented at the Mumbai workshop, with follow-up by Harry Renshall with each experiment, together with input from Bernd Panzer (T0) and the Pre-production team, and summarised on the SC4 planning page. • We have also held a number of meetings with representatives from all experiments to confirm that we have all the necessary input (all activities: PPS, SC, Tier0, …) and to spot possible clashes in schedules and / or resource requirements. (See “LCG Resource Scheduling Meetings” under LCG Service Coordination Meetings). • The conclusions of these meetings has been presented to the weekly operations meetings and the WLCG Management Board in written form (documents, presentations) • See the SC4 Combined Action List for more information… WLCG Service Challenges: Overview and Outlook
mainframe Microcomputer Mini Computer Cluster Summary of Experiment Plans • All experiments will carry out major validations of both their offline software and the service infrastructure during the next 6 months • There are significant concerns about the state-of-readiness (of everything…) – not to mention manpower at ~all sites + in experiments • I personally am considerably worried –- seemingly simply issues, such as setting up LFC/FTS services, publishing SRM end-points etc. have taken O(1 year) to be resolved (across all sites). • and [still] don’t even mention basic operational procedures • (Some big improvements here recently…) • And all this despite heroic efforts across the board • But – oh dear – your planet has just been blown up by the Vogons [ So long and thanks for all the fish ]
ATLAS SC plans/requirements • Running now till 7 July to demonstrate the complete Atlas DAQ and first pass processing with distribution of raw and processed data to Tier 1 sites at the full nominal rates. Will also include data flow to some Tier2 sites and full usage of the Atlas Distributed Data Management system, DQ2. Raw data to go to tape, processed to disk only. Sites to delete from disk and tape • After summer investigate scenarios of recovery from failing Tier 1 sites and deploy cleanup of pools at Tier 0. • Later, test distributed production, analysis and reprocessing. • DQ2 has a central role with respect to Atlas Grid tools • ATLAS will install local DQ2 catalogues and services at Tier 1 centres • ATLAS define a region of a Tier 1 and well network connected sites that will depend on the Tier 1 DQ2 catalogue. • Expect such (volunteer) Tier 2 to join SC when T0/T1 runs stably • ATLAS will delete DQ2 catalogue entries • Require VO box per Tier 0 and Tier 1 – done • Require LFC server per Tier 1 – done, must be monitored • Require FTS server and validated channels per Tier 0 and Tier 1 – close • Require ‘durable’ MSS disk area at Tier 1 – few sites have it. To be followed up by Atlas and SC team. • Atlas would like their T1 sites to attend (VRVS) their weekly (Wed at 14.00) SC review meeting during this running phase. No commitments were made. WLCG Service Challenges: Overview and Outlook
ALICE SC Plans • Validation of the LCG/gLite workload management services: ongoing • Stability of the services is fundamental for the entire duration of the exercise • Validation of the data transfer and storage services • 2nd phase: end July/August T0 to T1 (recyclable tape) at 300 MB/sec • The stability and support of the services have to be assured during and beyond these throughput tests • Validation of the ALICE distributed reconstruction and calibration model: August/September reconstruction at Tier 1 • Integration of all Grid resources within one single – interfaces to different Grids (LCG, OSG, NDGF) will be done by ALICE • End-user data analysis: September/October WLCG Service Challenges: Overview and Outlook
CMS SC Plans/Requirements • In September/October run CSA06, a 50 million event exercise to test the workflow and dataflow associated with the data handling and data access model of CMS • Now till end June • Continue to try to improve file transfer efficiency. Low rates and many errors now. • Attempt to hit 25k batch jobs per day and increase the number and reliability of sites aiming to obtain 90% efficiency for job completion • In July • Demonstrate CMS analysis submitter in bulk mode with the gLite RB • In July and August • 25M events per month with the production systems • Second half of July participate in multi-experiment FTS Tier-0 to Tier-1 transfers at 150 MB/sec out of CERN • Continue through August with transfers • Requirements: • Improve Tier-1 to Tier-2 transfers and the reliability of the FTS channels. • CMS are exercising the channels available to them, but there are still issues with site preparation and reliability • the majority of sites are responsive, but there is a lot of work for this summer • Require to deploy the LCG-3D infrastructure • From late June deploy Frontier for SQUID caches • All participating sites should be able to complete the CMS workflow and metrics (as defined in the CSA06 documentation) WLCG Service Challenges: Overview and Outlook
LHCB SC Plans/Requirements • Will start DC06 challenge at beginning of July using LCG production services and run till end August: • Distribution of raw data from CERN to Tier 1s at 23 MB/sec • Reconstruction/stripping at Tier 0 and Tier 1 • DST distribution to CERN and Tier 1s • Job prioritisation will be dealt with by LHCB but it is important jobs are not delayed by other VO activities • Preproduction for this is ongoing with 125 TB of MC data at CERN • Production will go on throughout the year for an LHCB physics book due in 2007 • Require SRM 1.1 based SE’s separated for disk and MSS at all Tier 1 as agreed in Mumbai and FTS channels for all CERN-T1’s • Data access directly from SE to ROOT/POOL (not just GridFTP/srmcp). For NIKHEF/SARA (firewall issue) this could perhaps be done via GFAL. • Require VO boxes at Tier 1 – so far at CERN, IN2P3, PIC and RAL. Need CNAF, NIKHEF and GridKa • Require central LFC catalogue at CERN and read-only copy at certain T1 (currently setting up at CNAF) • DC06-2 in Oct/Nov requires T1’s to run COOL and 3D database services WLCG Service Challenges: Overview and Outlook
Experiment Summary • All experiments will be ramping up their activity between now and first collisions • The period of ‘one experiment having priority’ – as was done in SC3 and for ATLAS until this weekend – is over • It is full, concurrent production from now on! WLCG Service Challenges: Overview and Outlook
Workshop Feedback • >160 people registered and (a few more) participated! • This is very large for a workshop – about same as Mumbai • Some comments related directly to this (~40 replies received so far) • Requests for more: • Tutorials, particularly “hands-on” • Direct Tier2 involvement • Feedback sessions, planning concrete actions etc. • Active help from Tier2s in preparing / defining future events would be much appreciated • Please not just the usual suspects… • See also Duncan Rand’s talk to GridPP16 • Some slides included below WLCG Service Challenges: Overview and Outlook
Tutorial Rating – 10=best WLCG Service Challenges: Overview and Outlook
Workshop Rating WLCG Service Challenges: Overview and Outlook
Workshop Comments “Very very inspiring” “Hope to do it again soon” “Tutorials were very useful” “The organisation was excellent” “Discussions were very enlightening” “Information collected together in one place” • Many positive comments on all sessions of the workshop and tutorials • Possibility to discuss with other sites and the developers also much appreciated • Sessions which some liked least others liked most! • I hope that the people who didn’t reply also feel the same! WLCG Service Challenges: Overview and Outlook
Workshop Summary • Workshops have been well attended and received • Feedback will help guide future events • Need to improve on Tier1+Tier2 involvement • Preparing agenda / chairing sessions / giving talks etc. • Strong demand for more tutorials • Hands-on where possible / appropriate • Thanks to everyone for their contribution to both workshop and tutorials! WLCG Service Challenges: Overview and Outlook
The Service Challenge programme this year must show that we can run reliable services • Grid reliability is the product of many components – middleware, grid operations, computer centres, …. • Target for September • 90% site availability • 90% user job success • Requires a major effort by everyone to monitor, measure, debug First data will arrive next year NOT an option to get things going later Too modest? Too ambitious?