180 likes | 298 Views
The first year of LHC physics analysis using the GRID: Prospects from ATLAS. Davide Costanzo University of Sheffield davide.costanzo@cern.ch. Giant detector, giant computing. ATLAS Computing. Grid based, multi-tier computing model:
E N D
The first year of LHC physics analysis using the GRID:Prospects from ATLAS Davide Costanzo University of Sheffield davide.costanzo@cern.ch
ATLAS Computing • Grid based, multi-tier computing model: • Tier-0 at CERN. First step processing (within 24 hours), storage of Raw data, first-pass calibration • Tier-1. About 10 worldwide. Reprocessing, data storage (real data and simulation), … • Tier-2. Regional facilities. Storage of Analysis Object Data, simulation, … • Tier-3. Small clusters, users’ desktops. • 3 different “flavors” of grid middleware: • LCG in Europe, Canada and Far East • OSG in the US • Nordugrid in Scandinavia and few other countries
Event Processing Data Flow Detector Output. Bytestream object view Simulation Output Raw Data Objects (RDO) ~3 MBytes Detector Reconstruction • Tracks, Segments • Calorimeter Towers • …. Size/event Event Summary Data (ESD) 500 Kbytes (target for stable data taking) • Analysis Objects: • Electron, Photon, • Muon, TrackParticle, • …. Combined Reconstruction User Analysis 100 KBytes Analysis Object Data (AOD)
Simplified ATLAS Analysis • Ideal Scenario: • Read AOD and create ntuple • Loop over ntuple and make histograms • Use root, make plots • go to ICHEP (or other conference) • Realistic Scenario: • Customization in the AOD building stage • Different analysis have different needs • Start-up Scenario: • Iterations needed on some data sample to improve Detector Reconstruction • Distributed event processing (on the Grid) • Data sets “scattered” across several grid systems • Need distributed analysis Few times/week Several times/day Once a month?
ATLAS and the Grid: Past experience • 2002-3 Data Challenge 1 • Contribution from about 50 sites. First use of the grid • Prototype distributed data management system • 2004 Data Challenge 2 • Full use of the grid • ATLAS middleware not fully ready • Long delays, simulation data not accessible • Physics Validation not possible. Events not used for physics analysis • 2005 “Rome Physics Workshop” and combined test beam • Centralized job definition • First users’ exposure to the Grid (deliver ~10M validated events) • Learn pros and cons of Distributed Data Management (DDM)
ATLAS and the Grid: Present (and Future) • 2006 Computing System Commissioning (CSC) and Calibration Data Challenge • Use subsequent bug-fix sw releases to ramp-up the system (Validation) • Access (distributed) database data (eg calibration data) • Decentralize job definition • Test distributed analysis system • 2006-7 Collection of about 25 physics notes • Use events produced for CSC • Concentrate on techniques to estimate Standard model background • Prepare physicists for the LHC challenge • 2006 and beyond. Data taking • ATLAS is already taking cosmic data • Collider data is about to start • Exciting physics is behind the corner
ATLAS Distributed Data Management • ATLAS reviewed all its own Grid systems during the first half of 2005 • A new Distributed Data Management System (DDM) was designed: • A hierarchical definition of datasets • Central dataset catalogues • Data-blocks as units of file storage and replication • Distributed file catalogues • Automatic data transfer mechanisms using distributed services (dataset subscription system) • The DDM system allows the implementation of the basic ATLAS Computing Model concepts, as described in the Computing Technical Design Report (June 2005): • Distribution of raw and reconstructed data from CERN to the Tier-1s • Distribution of AODs (Analysis Object Data) to Tier-2 centres for analysis • Storage of simulated data (produced by Tier-2s) at Tier-1 centres for further distribution and/or processing
ATLAS Data Management Model • Tier-1s send AOD data to Tier-2s • Tier-2s produce simulated data and send them to Tier-1s • In the ideal world (perfect network communication hardware and software) we would not need to define default Tier-1—Tier-2 associations • In practice, it turns out to be convenient (robust) to partition the Grid so that there are default (not compulsory) data paths between Tier-1s and Tier-2s • In this model, a number of data management services are installed only at Tier-1s and act also on their “associated” Tier-2s
Job Management: Productions • Once we have data distributed in the correct way we can rework the distributed production system to optimise job distribution, by sending jobs to the data (or as close as possible to them) • This was not the case previously, as jobs were sent to free CPUs and had to copy the input file(s) to the local WN, from wherever in the world the data happened to be • Next: make better use of the task and dataset concepts • A “task” acts on a dataset and produces more datasets • Use bulk submission functionality to send all jobs of a given task to the location of their input datasets • Minimize the dependence on file transfers and the waiting time before execution • Collect output files belonging to the same dataset to the same SE and transfer them asynchronously to their final locations
super Python LSF exe Lexor-CG Lexor T0MS LSF ATLAS Production System (2006) prodDB (jobs) Tasks DMS (Data Management) DQ2 Eowyn super super super super Python Python Python Python EGEE exe EGEE exe NG exe OSG exe Dulcinea PanDA EGEE NorduGrid OSG
Job Management: Analysis • A system based on a central database (job queue) is good for scheduled productions (as it allows proper priority settings), but too heavy for user tasks such as analysis • Lacking a global way to submit jobs, a few tools have been developed to submit Grid jobs in the meantime: • LJSF (Lightweight Job Submission framework) can submit ATLAS jobs to the LCG/EGEE Grid • Pathena (parallel version of atlas sw framework – athena) can generate ATLAS jobs that act on a dataset and submits them to PanDA on the OSG Grid • The ATLAS baseline tool to help users to submit Grid jobs is Ganga • First ganga tutorial given to ATLAS 3 weeks ago • Ganga and pathena integrated to submit jobs to different grids
Local system (Ganga) Prepare JobOptionsFind dataset from DDMGenerate & submit jobs Local system (Ganga) Job book-keeping Get Output GridRun Athena ProdSys Run Athena on Grid Store o/p on Grid Local system (Ganga) Job book-keeping Access output from Grid Merge results Local system (Ganga) Prepare JobOptionsFind dataset from DDMGenerate & submit jobs ATLAS Analysis Work Model 1. Job Preparation Local system (shell) Prepare JobOptions Run Athena (interactive or batch) Get Output 2. Medium-scale testing 3. Large scale running
Distributed analysis use cases • Statistics analyses (eg W mass) on several million event datasets: • All data files may not be kept on a local disk • Jobs are sent on AODs on the grid to make ntuples for analysis • Parallel processing required • Select a few interesting (candidate) events to analyze (eg H→4ℓ): • Information on AODs may not be enough. • ESD files accessed to make a lose selection and copy candidate events on a local disk • Use cases to be exercised in the coming Computing System Commissioning tests
From managed production to Distributed Analysis • Central managed production is now “routine work”: • Request a dataset to a physics group convener • Physics groups collect requests • Physics coordination keeps tracks of all requests and pass them to computing operation team • Pros: Well organized, uniform software used, well documented • Cons: Bureaucratic! Takes time to get what you need… • Delegate definition of jobs to physics and combined performance working groups: • Remove a management layer • Still requires central organization to avoid duplication of effort • Accounting and priorities? • Job definition/submission for every ATLAS user: • Pros: you get what you want • Cons: no uniformity, some duplication of effort
Resource Management • In order to provide a usable global system, a few more pieces must work as well: • Accounting at user and group level • Fair share (job priorities) for workload management • Storage quotas for data management • Define ~25 groups and ~3 roles in VOMS: • Perhaps they are not trivial • Perhaps they must force re-thinking of some of the current implementations • In any case we cannot advertise a system that is “free for all” (no job priorities, no storage quotas) • Therefore we need these features “now”
Conclusions • ATLAS is currently using GRID resources for MC based studies and real data from Combined test-beam and Cosmic Rays • A user community is emerging • Continue to review critical components to make sure we have everything we need • Now we need stability and reliability more than new functionality • New components may be welcome in production, if they are shown to provide better performance than existing ones, but only after thorough testing in pre-production service instances • The challenge of data taking is still in front of us! • Simulation exercises can teach us several lessons, but they are just the beginning of the story…