210 likes | 398 Views
ATLAS Computing. XXI International Symposium on Nuclear Electronics & Computing Varna, Bulgaria, 10-17 September 2007 Alexandre Vaniachine Invited talk for ATLAS Collaboration. Outline. ATLAS Computing Model Distributed computing facilities: Tier-0 plus Grids: EGEE, OSG, and NDGF
E N D
ATLAS Computing XXI International Symposium on Nuclear Electronics & Computing Varna, Bulgaria, 10-17 September 2007 Alexandre Vaniachine Invited talk for ATLAS Collaboration
Outline • ATLAS Computing Model • Distributed computing facilities: • Tier-0 plus Grids: EGEE, OSG, and NDGF • Components: • Production system (jobs) • Data management (files) • Databases to keep track of jobs and files, etc • Status: • Transition from development to operations • Commissioning at pit and Tier-0 with cosmics data • Commissioning with simulations data by grid operations teams • Distributed analysis of these data by physicists • M4 cosmics run in August validated all these separate operations Alexandre Vaniachine
Credits • I wish to thank the Symposium organizers for their invitation and for their hospitality • This overview is biased by my own opinions • The CHEP07 last week simplified my task • I wish to thank my ATLAS collaborators for their contributions • I have added references to their CHEP07 contributions • All ATLAS CHEP contributions will be published as ATLAS Computing Notes and serve as a foundation for collaboration paper on ATLAS Computing to be prepared towards the end of 2007 • which will provide updates for ATLAS Computing TDR: http://atlas-proj-computing-tdr.web.cern.ch/atlas-proj-computing-tdr/Html/Computing-TDR.htm Alexandre Vaniachine
ATLAS Multi-Grid Infrastructure • ATLAS computing operates uniformly on three Grids with different interfaces • Focus is shifting to physics analysis performance • testing, integration, validation, deployment, documentation • then operations! Alexandre Vaniachine
40+ sites Worldwide Farbin [id 83] ATLAS Computing Model: Roles of Distributed Tiers • Reprocessing of full data a few months after data taking, as soon as improved calibration and alignment constants are available • Managed Tape Access: RAW, ESD • Disk Access: AOD, fraction of ESD Alexandre Vaniachine
Ongoing Transition from Development to Operations • To facilitate operational efficiency ATLAS computing separates development activities from operations • No more “clever developments” - pragmatic addition of vital residual services to support operations: • Such as monitoring tools (critical to operations) • ATLAS CHEP07 contributions covered in detail both development reports and operational experience: Alexandre Vaniachine
Job – unit for data processing workflow management in Grid Computing - Managed by ATLAS Production System: • File - unit for data management in Grid Computing - Managed by ATLAS Distributed Data Management (DDM) System ATLAS job configuration and job completion are stored in prodDB • ATLAS files are grouped in Datasets: • ATLAS jobs are grouped in Tasks: AMI: ATLAS Metadata Information DB – “Where is my dataset?” DDM Central Catalogs AKTR DB for physics tasks Keeping Track of Jobs and Files Needs Databases • To achieve robust operations on the grids ATLAS splits data processing tasks of petabytes of event data into smaller units - jobs and files • Database Operations will be covered in a separate talk later in this session Alexandre Vaniachine
ATLAS Multi-Grid Operations Architecture and Results • Leveraging underlying database infrastructure ATLAS Production System and ATLAS DDM successfully manage simulations workflow on three production grids: EGEE, OSG and NDGF • Statistics of success - from ATLAS production database Alexandre Vaniachine
Leveraging Growing Resources on Three Grids • Latest snapshot of ATLAS resources [Yuri Smirnov, CHEP talk 184] Alexandre Vaniachine
CERN to Tier-1 Transfer Rates • ATLAS has the largest nominal CERN to Tier-1 transfer rate • Tests this spring reached ~75% of the nominat target • Successful use of all ten ATLAS Tier-1centers: Alexandre Vaniachine
Validation Computing Model with Realistic Data Rates • M3 Cosmics Run (mid-July) • Cosmics produced about 100 TB in 2 weeks • Stressed offline by running at 4 times the nominal rate • LAr 32 samples test • M4 Cosmics Run: August 23 – September 3 • Metrics for success • full rate Tier-0 processing • data exported to 5 of 10 Tier-1s and stored • for 2 of 5 Tier-1s exports to at least two Tier-2s • quasi real-time analysis in at least one Tier-2 • reprocessing in September in at least one Tier-1 • M5 Cosmics Run scheduled for October 16-23 • M6 Cosmics Run will run from end December until real data • Incremental goals, reprocessing between runs • Will run close to nominal rate • Maybe ~420 TB by start of run, plus Monte Carlo Alexandre Vaniachine
Validating Computing Model with Realistic Data Rates • M4 Cosmics Run: August 23 – September 3 • Raw data distribution in real time from online to Tier-0 and to all Tier-1s • The full chain worked with all ten Tier-1s at a target rate Throughput in MB/s from T0 to all T1’s Expected max rate Every day rates were ramping-up Last day of run Alexandre Vaniachine
Real-time M4 Data Analysis • Tracks in the muon chambers (right) and in the TRT (below) • Analysis done simultaneously in European and US T1/2 sites Alexandre Vaniachine
An Important Milestone • Metrics for success • Full rate T0 processing OK • Data exported to 5 / 10 T1’s and stored OK, and did more ! • For 2 / 5 T1’s exports to at least 2 T2’s OK • Quasi-rt analysis in at least 1 T2 OK, and did more ! • Reprocessing in Sept. in at least 1 T1 in preparation • Last week ATLAS has shown to master for the first time the whole data chain: from a measurement of a real cosmic ray muon in the detector until an almost real-time analysis in sites in Europe and the US with all steps in between Alexandre Vaniachine
ATLAS Event Data Model Alexandre Vaniachine
ATLAS Full Dress Rehearsal • Simulated events injected in the TDAQ • Realistic physics mix in bytestream format incl. luminosity blocks • Real data file and dataset sizes, trigger tables, data streaming • Tier-0/Tier-1 data quality, express line, calibration running • Use of Conditions DB • Tier-0 reconstruction: ESD, AOD, TAG, DPD • Data exports to Tier-1 and Tier-2s • Remote analysis • at the Tier-1s: • Reprocessing from RAW ESD, AOD, DPD, TAG • Remake AOD from ESD • Group based analysis DPD • at the Tier-2s and Tier-3s: • Root based analysis • Trigger aware analysis with Cond. and Trigger db • No MC truth, user analysis • MC/Reco production in parallel Alexandre Vaniachine
FDR Schedule Round 1 • Data streaming tests: DONE • Sept/Oct 07: Data preparation STARTS SOON • End Oct 07: Tier-0 operations tests • Nov 07 - Feb 08: Reprocess at Tier-1, make group DPD's Round 2ASSUMING NEW G4 • Dec 07 – Jan 08: New data production for final round • Feb 08: Data prep for final round using • Mar 08: Reco final round ASSUMING SRM v2.2 • Apr 08: DPD production at Tier-1s • Apr 08: More simulated data prod in preparation for first data. • May 08: final FDR • First pass production should be validated by year-end • Reprocessing will be validated months later • Analysis roles will be validated Alexandre Vaniachine
Ramping Up Computing Resources for LHC Datataking • Change of LHC schedule makes little change to the resource profile • Recall the early data is for calibration and commissioning • This is needed either from collisions or cosmics Alexandre Vaniachine
ATLAS Analysis Model • Basic principle: Smaller data can be read faster • Skimming - Keep interesting events • Thinning - Keep interesting objects in events • Slimming - Keep interesting info in objects • Reduction - Build higher-level data • Derived Physics Data • Share the schema with objects in the AOD/ESD • Can be analyzed interactively 19 Farbin id 83 Alexandre Vaniachine
Analysis: Grid Tools and Experiences • On the EGEE and NDGF infrastructure ATLAS uses direct submission to the middleware using GANGA • EGEE: LCG RB and gLite WMS • NDGF: ARC middleware • On OSG PANDA system • Pilot based system • Also available at some EGEE sites • Many users have been exposed to the grid • Work is getting done • Simple user interface is essential to simplify the usage • But experts required to understand the problem • Sometimes user have the impression that they are debugging the grid Alexandre Vaniachine
Conclusions • ATLAS computing is addressing unprecedented challenges • we are in a final stages of mastering how to handle those challenges • ATLAS experiment mastered complex multi-grid computing infrastructure at the scale close to the expectations for running conditions • Resource utilization for simulated event production • Transfers from CERN • A coordinated shift from development to operations/services is happening in the final year of preparation • An increase in scale is expected in facility infrastructure and the corresponding ability to use new capacities effectively • User analysis activities are ramping up Alexandre Vaniachine