230 likes | 248 Views
Results of the LHCb experiment Data Challenge 2004. Joël Closier CERN / LHCb CHEP’ 04. The LHCb DC04 team. Dirac Andrei Tsaregorodtsev, Vincent Garonne, Ian Stokes-Rees Production management Joel Closier, Ricardo Graciani (LCG), Johan Blouw, Andrew Pickford … and the LHCb site managers
E N D
Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04
The LHCb DC04 team • Dirac • Andrei Tsaregorodtsev, Vincent Garonne, Ian Stokes-Rees • Production management • Joel Closier, Ricardo Graciani (LCG), Johan Blouw, Andrew Pickford … and the LHCb site managers • LHCb Bookkeeping, Monitoring & accounting • Markus Frank, Carmine Cioffi, Manuel Sanchez, Ruben Vizcaya • LCG-LHCb liaison • Flavia Donno, Roberto Santinelli • The LCG-GDA team • Ian Bird, Laurence Field, Maarten Litmaath, Markus Schulz, David Smith, Zdenek Sekera, Marco Serra… Result of LHCb DC04
Outline • Aims of the LHCb Data Challenge 2004 • Production model • Performances of DC’04 • Lessons from DC’04 • Conclusions Result of LHCb DC04
LHCb DC’04 aims • Main goal :gather information to be used for writing the LHCb computing Technical Design Report • Robustness test of the LHCb software and production system • Using software as realistic as possible in terms of performance • Test of the LHCb distributed computing model • Including distributed analyses • Realistic test of analysis environment, need realistic analyses • Incorporation of the LCG application area software into the LHCb production environment • Use of LCG resources (at least 50% of the production capacity) • 3 phases • Production : MC simulation and reconstruction • Stripping : Event pre-selection • Analysis Result of LHCb DC04
LHCb DC04 aims (cont’d) • Physics goals • HLT studies, consolidating efficiencies • Background/Signal studies, consolidate background estimates + background properties • Requires quantitative increase in number of signal and background events compared to DC03: • 30 106 signal events • 15 106 specific background • 125 106 background (B inclusive + minimum bias, ratio 1:1.8) Result of LHCb DC04
Production • Production done with DIRAC system • Track 4 - Distributed Computing Services : id 377 • DIRAC is deployed to each site participating to DC’04 • Central Services supporting the Data Challenge • Production database • Workload Management System • Monitoring, Accounting • Bookkeeping, ALIEN File Catalog • Technologies used by the production services • C++, python, XML-RPC • ORACLE and mysql databases Result of LHCb DC04
Non LCG site DIRAC deployment (CE). DIRAC JobAgent: Check CE status. Request a DIRAC task (jdl). Install LHCb software if needed Submit to Local Batch System the job. Execute task: Check Steps. Upload results DIRAC TransferAgent. LCG site Input SandBox: Small bash script (~50 lines). Check environment: Site, hostname, CPU, Memory, Disk Space… Install DIRAC: Download DIRAC tarball (~1 MB). Deploy DIRAC on WN. Execute the job: Request a DIRAC task (LHCb Simulation job) Execute task: Check Steps Upload results: Retrieval of SandBox Analysis of Retrieved Output SandBox LHCb job Result of LHCb DC04
Strategy • Test sites: • Each site is tested with special and production-like jobs. • Enable site : • DIRAC Workload Management System. • Always keep jobs in the queues DIRAC • Run Local Agent continuously: • Via cron jobs • Via runsv • Via daemon LCG • Submit jobs continuously: • Via cron job on User Interface PS: LCG is considered as a site for DIRAC point of view Result of LHCb DC04
Data Storage • All the output of the reconstructed phase (DST) are send to CERN (as Tier0) • All the intermediate files are not kept. • DSTs are also stored in one of our 5 TIER1 • CNAF (Italy) • Karlsruhe (Germany) • Lyon (France) • PIC (Spain) • RAL (United Kingdom) Result of LHCb DC04
DC’04 performances Result of LHCb DC04
186 M Produced Events Phase 1 Completed 3-5 106/day LCG restarted LCG paused LCG in action 1.8 106/day DIRAC alone Phase 1 results Result of LHCb DC04
5 million/day Daily performance Result of LHCb DC04
Sites involved 20 DIRAC Sites Used resources from non-LHCb countries e.g. Hungary produced ~2M events 43 LCG Sites (8 also DIRAC sites) Result of LHCb DC04
Simultaneous jobs (a snapshot) Result of LHCb DC04
TIER storage Result of LHCb DC04
DIRAC-LCG : events share 50% of events were produced using LCG Result of LHCb DC04
DIRAC – LCG : CPU share 376 CPU · Years May: 88%:12% 11% of DC’04 Jun: 78%:22% 25% of DC’04 Jul: 75%:25% 22% of DC’04 Aug: 26%:74% 42% of DC’04 Result of LHCb DC04
LCG performance 211k Submitted Jobs to LCG After Running: 113 k Done (Successful) 34 k Aborted LCG Efficiency: 61 % Result of LHCb DC04
DC’04 lessons Result of LHCb DC04
Lessons learnt: DIRAC • The concept of the light, customizable and simple to deploy agents proved to be very effective • Easy update procedure - propagate bug fixes quickly of DIRAC tools • Applications software installation triggered by a running job • Most of the central services were running on the same machine • Too many processes, high loads • Improve Server Availability • Improve Error Handling and Reporting. Result of LHCb DC04
Lessons learnt: LCG • ImproveOutputSandBoxUpload | Retrieval mechanism: • Should also be available for Failed and Aborted Jobs. • Improve reliability ofCE statuscollection methods (timestamps?). • Add intelligence on CE or RB todetectand avoid large number ofaborted jobson start-up: • Avoid miss-configured site to become a black-hole. • Need tocollect LCG-log infoand tool to navigate them (including different JobIDs). • Need a way tolimit the CPU(and Wall-clock time): • LCG Wrapper must issue appropriated signals to User Job to allow graceful termination. • How tomanuals: • Clear instruction to Site Managers on the procedure to shutdown a site (for maintenance and/or upgrade). • Problems with site configurations (LCG config, firewalls, gridFTP servers..) Result of LHCb DC04
Conclusions • LHCb DC’04 Phase 1 is over. • The Production Target was achieved: • 186 M Events in 424 CPU years. • ~ 50% on LCG Resources (75-80% at the last weeks). • LHCb Strategy successful: • Submitting “empty” DIRAC Agents to LCG has proven to be very flexible allowing a success rate above LCG alone. • Big room for improvements, both on DIRAC and LCG • DIRAC needs to improve in the reliability of the Servers: • big step already during DC. • LCG needs improvement on the single job efficiency: • ~40% aborted jobs, ~10% did the work but failed from LCG viewpoint. • In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in. • Success due to dedicated support from LCG team and DIRAC Site Managers Result of LHCb DC04
Other links • CHEP04 talks: • File-Metadata Management System for the LHCb Experiment • (Track 4 - Distributed Computing Services) id 392 • 27-Sep-2004 17:30 • DIRAC Workload Management System • (Track 5 - Distributed Computing Systems and Experiences) id 365 • 29-Sep-2004 10:00 • Grid Information and Monitoring System using XML-RPC and Instant Messaging for DIRAC • (Track 4 - Distributed Computing Services) id 368 • 29-Sep-2004 10:00 • DIRAC - The Distributed MC Production and Analysis for LHCb • (Track 4 - Distributed Computing Services) id 377 • 30-Sep-2004 18:10 • A Lightweight Monitoring and Accounting System for LHCb DC04 Production • (Track 4 - Distributed Computing Services) id388 • 30-Sep-2004 17:30 Result of LHCb DC04