160 likes | 342 Views
ATLAS and Grid Computing. RWL Jones GridPP 13 5 th July 2005. ATLAS Computing Timeline. 2003. POOL/SEAL release (done) ATLAS release 7 (with POOL persistency) (done) LCG-1 deployment (done) ATLAS complete Geant4 validation (done) ATLAS release 8 (done)
E N D
ATLAS and Grid Computing RWL Jones GridPP 13 5th July 2005
ATLAS Computing Timeline 2003 • POOL/SEAL release (done) • ATLAS release 7 (with POOL persistency) (done) • LCG-1 deployment (done) • ATLAS complete Geant4 validation (done) • ATLAS release 8 (done) • DC2 Phase 1: simulation production (done) • DC2 Phase 2: intensive reconstruction (the real challenge!) LATE! • Combined test beams (barrel wedge) (done) • Computing Model paper (done) • Computing Memorandum of Understanding (done) • ATLAS Computing TDR and LCG TDR (in progress) • Computing System Commissioning • Physics Readiness Report • Start cosmic ray run • GO! 2004 2005 NOW 2006 2007 Commissioning takes priority!
Computing TDR structure Major activity areas within the S&C Project Liaisons to other ATLAS projects The TDR describes the whole Software & Computing Project as defined within the ATLAS organization:
Massive productions on 3 Grids (3) • July-September 2004: DC2 Geant-4 simulation (long jobs) • 40% on LCG/EGEE Grid, 30% on Grid3 and 30% on NorduGrid • February-May 2005: Rome production • 70% on LCG/EGEE Grid, 25% on Grid3, 5% on NorduGrid • LCG/EGEE Grid resources always difficult to saturate with “traditional” means • New approach (Lexor-CondorG) used Condor-G to submit directly to the sites • in this way the job rate was doubled on the same total available resources • much more efficient usage of the CPU resources • the same approach is now evaluated also for the Grid3/OSG Grid job submission which suffered also from job rate problems
Massive productions on 3 Grids (4) uibk.ac.at triumf.ca ATLAS Rome Production - Number of Jobs umomtreal.ca utoronto.ca cern.ch unibe.ch csvs.ch golias.cz skurut.cz gridka.fzk.de atlas.fzk.de lcg-gridka.fzk.de benedict.dk nbi.dk morpheus.dk ific.uv.es ft.uam.es ifae.es marseille.fr cclcgcdli.in2p3.fr clrece.in2p3.fr cea.fr isabella.gr kfki.hu cnaf.it lnl.it roma1.it mi.it ba.it pd.it lnf.it na.it to.it fi.it ct.it ca.it fe.it pd.it roma2.it bo.it pi.it sara.nl nikhef.nl uio.no hypatia.no zeus.pl lip.pt msu.ru hagrid.se bluesmoke.se sigrid.se pdc.se chalmers.se brenta.si 573315 jobs savka.sk ihep.su sinica.tw ral.uk shef.uk ox.uk ucl.uk ic.uk 22 countries lancs.uk man.uk ed.uk UTA.us BNL.us BU.us UC_ATLAS.us PDSF.us 84 sites FNAL.us IU.us OU.us PSU.us Hamptom.us UNM.us UCSanDiego.us Uflorida.su SMU.us CalTech.us ANL.us UWMadison.us UC.us Rice.us Unknown • 73 data sets containing 6.1M events simulated and reconstructed (without pile-up) • Total simulated data: 8.5M events • Pile-up done later (for 1.3M events done up to last week)
Experience with LCG-2 Operations • Support for our productions was excellent from the CERN-IT-EIS team • Other LCG/EGEE structures were effectively invisible (GOC, ROCs, GGUS etc) • no communication line between experiments and the Grid Operations Centres • operational trouble info always through the EIS group • sites scheduled major upgrades or downtimes during our productions • no concept of “service” for the service providers yet! • many sites consider themselves as part of a test structure set up (and funded) by EGEE • but we consider the LCG Grid as an operational service for us! • many sites do not have the concept of “permanent disk storage” in a Storage Element • if they change something in their filing system, our catalogue has to be updated!
Second ProdSys development cycle • The experience with DC2 and the Rome production taught us that we had to re-think at least some of the ProdSys components • The ProdSys review defined the way forward: • Frederic Brochu one of the reviewers • Keep the global ProdSys architecture (system decomposition) • Replace or re-work all individual components to address the identified shortcomings of Grid middleware: • reliability and fault tolerance first of all • Re-design the Distributed Data Management system to avoid single points of failure and scaling problems • Work is now underway • target is end of Summer for integration tests • ready for LCG Service Challenge 3 from October onwards
Distributed Data Management • Accessing distributed data on the Grid is not a simple task • Several central DBs are needed to hold dataset information • “Local” catalogues hold information on local data storage • The new DDM system(right) is under test this summer • It will be usedfor all ATLAS datafrom October on(LCG ServiceChallenge 3) • Affects GridPP effort
Computing Operations • The Computing Operations organization likely to change: • Grid Tools • Grid operations: • Tier-0 operations • re-processing of real and simulated data at Tier-1's • data distribution and placement • Software distribution and installation • Site and software installation validation and monitoring • Coordination of Service Challenges in 2005-2006 • User Support • Proposal to use Frederic Brochu in front-line triage • Credited contribution • Contingent on Distributed Analysis planning
Software Installation • Software installation continues to be a challenge • Rapid roll-out of release to the Grid important for ATLAS UK eScience goals (3.1.4) • Vital for user code in distributed analysis • Grigori Rybkine (50/50 GridPP/ATLAS eScience): • Working towards 3.1.5, kit installation and package management in distributed analysis • Package manager implementation supports tarball and locally-built code • Essential support role • 3.1.5 progressing well, 3.1.4 may have some delays because of external effort in nightly deployable packages
Current plans for EGEE/gLite • Ready to test new components as soon as they are released from the internal certification process • assume the LCG Baseline Services • Only seen the File Transfer Service & LCG File Catalogue • both being actively tested by our DDM group • FTS will be field-tested by Service Challenge 3 starting in July • LFC is in our plan for the new DDM (Summer deployment) • Not really seen the new Workload Management System nor the new Computing Element • some ATLAS informal access to pre-release versions • As soon as the performance is acceptable we will ask to have them deployed • this is NOT a blank check!
Distributed Analysis System • ATLAS and GANGA work now focused on Distributed Analysis • LCG RTAG 11 in 2003 did not produce a common analysis system project as hoped. ATLAS therefore planned to combine the strengths of various existing prototypes: • GANGA provides a Grid front-end for Gaudi/Athena jobs • DIAL provides fast, quasi-interactive, access to large local clusters • The ATLAS Production System to interface to the 3 Grid flavours • Alvin Tan • Work on the job-building GUI and Job Options Editor well received • Wish from LBL to merge JOE with Job Options Tracer project • Monitoring work also well received – prototypes perform well. • Frederic Brochu • Provided beta version of new job submission from GANGA direct to Production System
Distributed Analysis System (2) • Currently reviewing this activity to define a baseline for the development of start-up Distributed Analysis System • All this has to work together with the DDM system described earlier • Decide a baseline “now”, so we can have a testable system by this autumn • The outcome of the review may change GridPP plans
Conclusions • ATLAS is (finally) getting effective throughput from LCG • The UK effort is making an important contribution • The Distributed Analysis is continuing to pose a big challenge • ATLAS is taking the right management approach • GridPP effort will have to be responsive