ATLAS Computing Status, early results and issues

ATLAS Computing Status, early results and issues James Shank April, 2010 J. Shank DOSAR April 2010, Johannesburg

2010 Collision startup • ATLAS Computing model and resource estimates • Highlights of the recent ATLAS Distributed Computing meeting at BNL • Computing issues in the near future. J. Shank DOSAR April 2010, Johannesburg

J. Shank DOSAR April 2010, Johannesburg

The ATLAS Computing Model • Latest round of revision driven by the Computing Resources Scrutiny Group (CRSG) • A committee of the LHCC, reporting to CERN management • Much pressure from some funding agencies to reduce computing resources pledges since the LHC is delayed. • This forced us to revise our data replication policy. • No full AOD at T1s in initial distribution • Can be revised by “thermodynamic” data distribution • Still on-going discussion. Will culminate in the Resource Review Board meeting J. Shank DOSAR April 2010, Johannesburg

The ATLAS Computing Model(2) J. Shank DOSAR April 2010, Johannesburg

PanDa Usage for User Analysis jobs J. Shank DOSAR April 2010, Johannesburg

PanDa Usage for User Analysis jobs (2) J. Shank DOSAR April 2010, Johannesburg

PanDa Usage for User Analysis jobs (3) J. Shank DOSAR April 2010, Johannesburg

ATLAS Distributed Computing • Workshop held last week at BNL • http://indico.cern.ch/conferenceDisplay.py?confId=84669 • Some highlights in the following slides… J. Shank DOSAR April 2010, Johannesburg

J. Shank DOSAR April 2010, Johannesburg

Overview • PanDA based production system is ~3 years old • Core system has remained unchanged for past ~2 years • Many bug fixes, feature requests, and tuning • Some recent major changes • Panda: Moved servers to CERN, migrated to Oracle, pilot changes • AKTR: bulk task submission, job error management • Monitoring: DaTRI • Info system: AGIS Kaushik De

Goals for 2010-11 • Entering LHC operations period • Stability of production system is very important • Changes should be operations driven – bug fixes, tuning… • Production system support team • We expect increased support load with LHC data • Alas, support team is shrinking • Also, fewer experts with deep knowledge of system • Need more automation to keep functioning smoothly • Need better documentation of procedures, errors, checklists • Some big software updates still needed • Motivation: new feature requests, better automation • These changes must be tested outside running production system Kaushik De

Somewhat Random Wishlist • Pilot Factory integration • Glexec implementation • SchedConfigDB automation • AGIS evolution • Error code database – improve task completion • Software installation integrated with Panda • Documentation, documentation, documentation… • Automation, automation, automation… • Relaxing cloud constraint • Generic merge trf: automatic merge of _sub by Panda • Implement shorter time limits for holding, starting… • Debug ‘tail’ of task completion • “Additional Production” implementation Kaushik De

New ProdSys? • Do we need to rewrite ProdSys from scratch? • Eventually, all software systems become to old • Some components could benefit from re-write • Plan for major upgrade in 2012? Kaushik De

DDM: current status and possible future plans Simone Campana on behalf of DDM team

FAQs • Are there known limitations of the SW we are using now • What SW products shall we use after 2011 • How existing SW will evolve • Do we need revolution of any of the existing products or it will be evolution • Do we need better integration of ADC SW with non-ADC SW • What are our dependencies from grid SW • For a long time our slogan was 'Operation needs are driving SW development’ • How will we address increasing user's requests • What level of support do we expect from the Facilities • How can we automate our SW

The DDM stack

Storage Management In response to concerns expressed by LHC experiments. • “The LHC experiment managements have expressed concern over the performance and scalability of access to data, particularly for their analysis use cases.” • “…focused on setting the scope and goals for work that that would address these issues with a tentative timescale of 2013 for large scale use.” • “user access to data and the resulting system should hide the details of the back end mass storage systems and their implementations.” • Work Areas: • Data Archives and Storage Cloud • Data Access Layer • Output Datasets • Global home directory facilities • Catalogues • Authorization mechanisms • Workshop in early June for next steps… J. Shank DOSAR April 2010, Johannesburg

Conclusions • ATLAS Distributed Computing can survive early data-taking. • Long runs in 2010 and 2011 will put a strain on many systems • Still many questions about future • Evolution or Revolution? J. Shank DOSAR April 2010, Johannesburg

Other stuff… J. Shank DOSAR April 2010, Johannesburg

ATLAS Computing Status, early results and issues