270 likes | 286 Views
The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures. D. Liko, IT/PSS for the ATLAS Distributed Analysis Community. Overview. Distributed Analysis in ATLAS Grids, Computing Model The ATLAS Strategy Production system Direct submission Common Aspects Datamanagement
E N D
The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community
Overview • Distributed Analysis in ATLAS • Grids, Computing Model • The ATLAS Strategy • Production system • Direct submission • Common Aspects • Datamanagement • Transformation • GUI • Initial experiences • Production system on LCG • PANDA in OSG • GANGA
ATLAS Grid Infrastructure • Three grids • LCG • OSG • Nordugrid • Significant resources, but different middleware • Teams working on solutions are typically associated to a grid and its middleware • In principle ATLAS resources are available to all ATLAS users • Interest by our users to use their local systems in priority • Not only a central system, flexibility concerning middleware Poster 181: Prototype of the Swiss ATLAS Computing Infrastructure
Distributed Analysis • At this point emphasis on batch model to implement the ATLAS Computing model • Interactive solutions are difficult to realize on top of the current middleware layer • We expect our users to send large batches of short jobs to optimize their turnaround • Scalability • Data Access • Analysis in parallel to production • Job Priorities
ATLAS Computing Model • Data for analysis will be available distributed on all Tier-1 and Tier-2 centers • AOD & ESD • T1 & T2 are open for analysis jobs • The computing model foresees 50 % of grid resources to be allocated for analysis • Users will send jobs to the data and extract relevant data • typically NTuples or similar
Requirements • Data for a year of data taking • AOD – 150 TB • ESD • Scalability • Last year up to 10000 jobs per day for production (job duration up to 24 hours) • Grid and our needs will grow • We expect that our analysis users will run much shorter jobs • Job delivery capacity of the order of 106 jobs per day • Peak capacity • Involves several grids • Longer jobs can reduce this number (but might not always be practical) • Job Priorities • Today we need short queues • In the future we need to steer the resource consumption of our physics and detector groups based on VOMS groups
ATLAS Strategy • Production system • Seamless access to all ATLAS grid resources • Direct submission to GRID • LCG • LCG/gLite Resource Broker • CondorG • OSG • PANDA • Nordugrid • ARC Middleware
Dulcinea Dulcinea CE Dulcinea Dulcinea Lexor CondorG CE ProdDB ATLAS Prodsys Dulcinea PANDA Dulcinea Dulcinea RB CG RB RB CE
Production System • Provides a layer on top of the middleware • Increases the robustness by the system • Retrials and fallback mechanism both for workload and data management • Our grid experience is captured in the executors • Jobs can be run in all systems • Redesign based on the experiences of last year • New Supervisor - Eowyn • New Executors • Connects to new Data Management • Adaptation for Distributed Analysis • Configurable user jobs • Access control based on X509 Certificates • Graphical User Interface ATCOM Presentation 110: ATLAS Experience on Large Scale Production on the Grid
LCG • Resource Broker • Scalability • Reliability • Throughput • New gLite Resource Broker • Bulk submission • Many other enhancements • Studied in ATLAS LCG/EGEE Taskforce • Special setup in Milano & Bolongna • gLite – 2-way Intel Xeon 2.8 CPU (with hyper-threading), 3 GByte memory • LCG – 2-way Intel Xeon 2.4 CPU (without hyper-threading), 2 GByte memory • Both are using the same BDII (52 CE in total) • Several bug fixes and optimization • Steady collaboration with the developers
LCG vs gLite Resource Broker • Bulk submission much faster • Sandbox handling better and faster • Now the match making is the limiting factor • Strong effect from ranking
CondorG • Conceptually similar to LCG RB, but different architecture • Scaling by increasing the number of schedulers • No logging & bookkeeping, but a scheduler keeps tracks of the job • Used in parallel during DC2 & Rome production and increased our use of grid resources • Submission via the Production System, but also direct submission is imaginable Presentation 401: A Grid of Grids using CondorG
Last years experience • Adding CondorG based executor in the production system helped us to increase the number of jobs on LCG
PANDA • New prodsys executor for OSG • Pilot jobs • Resource Brokering • Close integration with DDM • Operational in the production since December Presentation 347: PANDA: Production and Distributed Analysis System for ATLAS
PANDA • Direct submission • Regional production • Analysis jobs • Key features for analysis • Analysis Transformations • Job-chaining • Easy job-submission • Monitoring • DDM end-user tool • Transformation repository
ARC Middleware • Standalone ARC client software – 13 MB Installation • CE has extended functionality • Input files can be staged and are cached • Output files can be staged • Controlled by XRSL, an extended version of globus RSL • Brokering is part of the submission in the client software • Job delivery rates of 30 to 50 per min have been reported • Logging & bookkeeping on the site • Currently about 5000 CPUs, 800 available for ATLAS
Common Aspects • Data management • Transformations • GUI
ATLAS Data Management • Based on Datasets • PoolFileCatalog API is used to hide grid differences • On LCG, LFC acts as local replica catalog • Aims to provide uniform access to data on all grids • FTS is used to transfer data between the sites • Evidently Data management is a central aspect of Distributed Analysis • PANDA is closely integrated with DDM and operational • LCG instance was closely coupled with SC3 • Right now we run a smaller instance for test purposes • Final production version will be based on new middleware for SC4 (FPS) Presentation 75: A Scalable Distributed Data Management System for ATLAS
Transformations • Common transformations is a fundamental aspect of the ATLAS strategy • Overall no homogeneous system …. but a common transformation system allows to run the same job on all supported systems • All system should support them • In the end the user can adapt easily to a new submission system, if he does not need to adapt his jobs • Separation of functionality in grid dependant wrappers and grid independent execution scripts. • A set of parameters is used to configure the specific job options • A new implementation in terms of python is under way
GANGA – The GUI for the Grid • Common project with LHCb • Plugins allow define applications • Currently: Athena and Gaudi, ADA (DIAL) • And backends • Currently: Fork, LSF, PBS, Condor, LCG, gLite, DIAL and DIRAC Presentation 318: GANGA – A Grid User Interface
New version 4 Job splitting GUI Work on plugins to various system is ongoing GANGA latest development
Initial experiences • PANDA on OSG • Analysis with the Production System • GANGA
PANDA on OSG • pathena • Lightweight submission interface to PANDA • DIAL • System submits analysis jobs to PANDA to get acces to grid resources • First users are working on the system Presentation 38: DIAL: Distributed Interactive Analysis of Large Datasets
Distributed Analysis using Prodsys • Currently based on CondorG • Lexor based system on its way • GUI ATCOM • Central team operates the executor as a service • Several analysis were ported to the system • Selected users are testing it Poster 264: Distributed Analysis with the ATLAS Production System
GANGA • Most relevant • Athena application • LCG backend • Evaluated by several users • Simulation & Analysis • Faster submission necessary • Prodsys/PANDA/gLite/CondorG • Feedback • All based on the CLI • New GUI will be presented soon
Summary • Systems have been exposed to selected users • Positive feedback • Direct contact to the experts still essential • For this year – power users and grid experts … • Main issues • Data distribution → New DDM • Scalability → New Prodsys/PANDA/gLite/CondorG • Analysis in parallel to Production → Job Priorities
As of today Distributed Analysis in ATLAS is still work in progress (the detector too) The expected data volume require us to perform analysis on the grid Important pieces are coming into place We will verify Distributed Analysis according to the ATLAS Computing Model in the context of SC4 Conclusions