440 likes | 448 Views
Explore the distributed analysis activities during SC4 and review the computing model, workload management, and data management for efficient processing of ATLAS data. Learn about the AOD and ESD analysis models, TAG-based analysis, and the Tier-2 center requirements in the Computing Model. Understand the challenges and strategies for analyzing data locally and the distributed management of data between grids in the ATLAS project.
E N D
ATLAS Analysis Use Dietrich Liko
Credits • GANGA Team • PANDA DA Team • ProdSys DA Team • ATLAS EGEE/LCG Taskforce • EGEE Job Priorities WG • DDM Team • DDM Operations Team
Overview • ATLAS Computing Model • AOD & ESD analysis • TAG based analysis • Data Management • DDM and Operation • Workload Management • EGEE, OSG & Nordugrid • Job Priorities • Distributed Analysis Tools • GANGA • pathena • Other tools: Prodsys, LJSF • Distributed Analysis activities during SC4 • Tier-2 Site requirements
Tier-2 in the Computing Model • Tier-2 centers have an important role • Calibration • Simulation • Analysis • Tier-2 provide analysis capacity for the physics and detector groups • In general chaotic access patterns • Typically a Tier-2 center will host … • Full TAG samples • One third of the full AOD sample • Selected RAW and ESD data • Data will be distributed according to guidelines given by the physics groups
Analysis models • For efficient processing it is necessary to analyze data locally • Remote access to data is discouraged • To analyses the full AOD and the ESD data it is necessary to locate the data and send the jobs to the relevant sites • TAG data at Tier-2 will be file based • Analysis uses same pattern as AOD analysis
AOD Analysis • The assumption is that the users will perform some data reduction and generate Athena aware ntuples (AANT) • There are also other possibilities • This development is steered by the Physics Analysis Tools group (PAT) • AANT are then analyzed locally
Aims for SC4 & CSC • The main data to be analyzed are the AOD and the ESD data from the CSC production • The total amount of data is still small …few 100 GB • We aim first at a full distribution of the data to all interested sites • Perform tests of the computing model by analysis these data and measurement of relevant indicators • Reliability • Scalability • Throughput • Simply answer the questions … • How long does it take to analyze the expected 150 TB of AOD data corresponding to one year of running of the LHC • And what happens if several of you try to do it at the same time
In the following …. • I will discuss the technical aspects necessary to achieve these goals • ATLAS has three grids with different middleware • LCG/EGEE • OSG • Nordugrid • Data Management is shared between the grids • But there are grid specific aspects • Workload Management is grid specific
Data Management • I will review only some aspect related to Distributed Analysis • The ATLAS DDM is based on Don Quijote 2 (DQ2) • See the tutorial session on Thursday for more details • Two major tasks • Data registration • Datasets are used to increase the scalability • Tier-1’s are providing the necessary services • Data replication • Based on the gLite File Transfer Service (FTS) • Fallback to SRM or gridftp possible • Subscription are used to manage the actual file transfer
How to access data on a Tier-2 Dataset catalog http CE Tier 0 rfio dcap gridftp nfs lrc protocol VOBOX SE LRC FTS Tier 2 Tier 1
To distributed the data for analysis … • Real data • Data recorded and processed at CERN (Tier-0) • Data distribution via Tier-1 • Reprocessing • Reprocessing at Tier-1 • Distribution via other Tier-1 • Simulation • Simulation at Tier-1 and associated Tier-2 • Collection of data at Tier-1 • Distribution via other Tier-1
For analysis … • The challenge is not the amount of data, but the management of the overlapping flow patterns • For SC4 we have a simpler aim … • Obtain a equal distribution for the current available simulated data • Data from the Tier-0 exercise is not useful for analysis • We will distribute only useful data
Grid specific aspects • OSG • DDM fully in production since January • Site services also at Tier-2 centers • LCG/EGEE • Only dataset registration in production • New deployment model addresses this issue • Migration to new version 0.2.10 on the way • Nordugrid • Up to now only dataset registration • Final choice of file catalog still open
New version 0.2.10 • Many essential features for Distributed Analysis • Support for the ATLAS Tier structure • Fallback from FTS to SRM and gridftp • Support for disk areas • Parallel operation of production and SC4 • And many more …. • Deployment is on the way • We hope to see in production very soon • OSG/Panda has to move to it asap • We should stay with this version until autumn • The success of Distributed Analysis during SC4 is crucially depending on the success of this version
Local throughput • An Athena job has in the ideal case about 2MB/sec data throughput • The limit given by the persistency technology • Storegate-POOL-ROOT • Up to 50% of the capacity of a site is dedicated to analysis • We plan to access data locally via the native protocol (rfio, dcap, root etc) • Local network configuration should take that into account
Data Management Summary • Data Distribution is essential for Distributed Analysis • DQ2 0.2.10 has the required features • There is a lot of work in front of us to control and validate data distribution • Local configuration determined by Athena I/O
Workload management • Different middleware • Different teams • Different submission tools • Different submission tools are confusing to our users .. • We aim to obtain some common ATLAS UI following the ideas of pathena tool (see later) • But …. the priority for Distributed Analysis in the context of SC4 is to solve the technical problems within each grid infrastructure
LCG/EGEE Sites Sites gLite UI gLite Resource Broker Sites Sites Dataset Location Catalog BDII
Advantages of the gLite RB • Bulk submission • Increased performance and scalability • Improved Sandbox handling • Shallow resubmission • If you want to use a Resource Broker for Distributed Analysis, you want to use finally the gLite RB • Status • Being pushed into deployment with gLite 3.0 • Has not yet the same maturity as the LCG RB • Turing the gLite RB into production quality has evidently a high priority
ATLAS EGEE/LCG Taskforce gLite LCG Bulk submission Multiple threads Submission: 0.3 sec/job 2 sec/job Matchmaking: 0.6 sec/job Job submission is not the limit any more (there are other limits ….)
Plan B: Prodsys + CondorG Sites ProdDB Sites CondorG Executor Sites Sites Dataset Location Catalog CondorG Negotiator BDII
Advantages • Profits from the experiences of the production • Proven record from the ongoing production • Better control on the implementation • CondorG by itself is also part of the Resource Broker • Performance • ~ 1 sec/job • Status • Coupled to the evolvement of the production system • Work on GANGA integration has started
OSG • On OSG ATLAS uses PANDA for Workload management • Concept is similar to DIRAC and Alien • Fully integrated with the DDM • Status • In production since January • Work to optimize the support for analysis users is ongoing
Advantages for DA • Integration with DDM • All data already available • Jobs start only when the data is available • Late job binding due to pilot jobs • Addresses grid inefficiencies • Fast response for user jobs • Integration of production and analysis activities
Nordugrid • ARC middleware • Compact User Interface • 14 MB vs • New version has very good job submission performance • 0.5 to 1 sec/job • Status • Open questions: Support for all ATLAS users, DDM integration • Analysis capability seems to be coupled with the planned NG Tier-1
ARC Job submission ARC UI Sites RLS Nordugrid IS
WMS Summary • EGEE/LCG • gLite Resource Broker • Prodsys & CondorG • OSG • PANDA • Nordugrid • ARC • Different system – different problems • All job submission systems need work to optimize user analysis
Job Priorities • Different infrastructures – different problems • EGEE • Job Priority WG • OSG/Panda • Separate cloud of analysis pilots • Nordugrid • Typically a site has several queues
EGEE Job Priority WG • TCG working group • ATLAS & CMS • LHCb & Diligent observer • JRA1 (developers) • SA3 (deployment) • Several sites (NIKHEF, CNAF)
ATLAS requirements • Split site resources in several shares • Production • Long jobs • Short jobs • Other jobs • Objectives • Production should not be pushed from a site • Analysis jobs should bypass production jobs • Local fairshare
Proposed solution Role=Production Production 70% CE Long 20 % Role=Software Software 1 % CE Short 9 %
Status • Based on VOMS Roles • Role=Production • Role=Software • No new middleware • A patch to the WMS has to be back ported • Test Installation • NIKHEF, CNAF • TCG & WLCG MB have agreed to the proposed solution • We are planning the move to the preproduction service • Move to the production sites in the not so far future • In the future • Support for physics groups • Dynamic settings • Requires new middleware (as GPBOX)
PANDA • Increase the number of analysis pilots • Fast pickup of user jobs • First job can start in few seconds • Several techniques are being studied • Multitasking pilots • Analysis queues
Job Priorities Summary • EGEE/LCG • New site configuration • OSG/Panda • Addressed by PANDA internal developments
DA User Tools • pathena • PANDA tool for Distributed Analysis • Close to the Physics Analysis Tools group (PAT) • GANGA • Common project between LHCb & ATLAS • Used for Distributed Analysis on LCG
pathena • Developed in close collaboration between PAT and PANDA • Local job athena ttbar_jobOptions.py • Grid job pathena ttbar_jobOptions.py –inDS csc11.005100.ttbar.recon.AOD… –split 10
GANGA • Framework for job submission • Based on plugins for • Backends • LCG, gLite, CondorG, LSF, PBS • Applications • Athena, Executable • GPI abstraction layer • Python Command Line Interface (CLIP) • GUI
Other Tools • LJSF (Light Job submission framework) • Used for ATLAS software installations • Runs ATLAS transformations on the grid • No integration with DDM yet • DA with Prodsys • Special analysis transformations • Work to interface with GANGA has started
DA Tools Summary • Main tools • pathena with PANDA/OSG • GANGA with Resource Broker/LCG • Integration with Athena as demonstrated by pathena is a clear advantage • GANGA plug-in mechanism allows in principle to obtain a unique interface • Priority for the GANGA team is to deliver a robust solution on LCG first
Distributed Analysis in SC4 • Data distribution • Ongoing activity with the DDM operations team • Site configuration • We will move soon to the preproduction service • In few weeks we will then move to the production sites • Exercising the job priority model • Analysis in parallel to production • Scaling tests of the computing infrastructure • Measurement of the turnaround time for analysis of large datasets
SC4 Timescale • We plan to perform DA tests in August and then later in autumn • The aim is to quantify the current characteristics of the Distributed Analysis systems • Scalability • Reliability • Throughput • Simply answer the questions … • How long does it take to analyze the expected 150 TB of AOD data corresponding to one year of running of the LHC • And what happens if several of you try to do it at the same time
Tier-2 Site Requirements • Configuration of the batch system to support the job priority model • gLite 3.0 • Analysis and production in parallel • Data availability • Connect to DDM • Disk area • Sufficient local throughput