1 / 44

ATLAS Analysis Use

Explore the distributed analysis activities during SC4 and review the computing model, workload management, and data management for efficient processing of ATLAS data. Learn about the AOD and ESD analysis models, TAG-based analysis, and the Tier-2 center requirements in the Computing Model. Understand the challenges and strategies for analyzing data locally and the distributed management of data between grids in the ATLAS project.

demarcoj
Download Presentation

ATLAS Analysis Use

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS Analysis Use Dietrich Liko

  2. Credits • GANGA Team • PANDA DA Team • ProdSys DA Team • ATLAS EGEE/LCG Taskforce • EGEE Job Priorities WG • DDM Team • DDM Operations Team

  3. Overview • ATLAS Computing Model • AOD & ESD analysis • TAG based analysis • Data Management • DDM and Operation • Workload Management • EGEE, OSG & Nordugrid • Job Priorities • Distributed Analysis Tools • GANGA • pathena • Other tools: Prodsys, LJSF • Distributed Analysis activities during SC4 • Tier-2 Site requirements

  4. Tier-2 in the Computing Model • Tier-2 centers have an important role • Calibration • Simulation • Analysis • Tier-2 provide analysis capacity for the physics and detector groups • In general chaotic access patterns • Typically a Tier-2 center will host … • Full TAG samples • One third of the full AOD sample • Selected RAW and ESD data • Data will be distributed according to guidelines given by the physics groups

  5. Analysis models • For efficient processing it is necessary to analyze data locally • Remote access to data is discouraged • To analyses the full AOD and the ESD data it is necessary to locate the data and send the jobs to the relevant sites • TAG data at Tier-2 will be file based • Analysis uses same pattern as AOD analysis

  6. AOD Analysis • The assumption is that the users will perform some data reduction and generate Athena aware ntuples (AANT) • There are also other possibilities • This development is steered by the Physics Analysis Tools group (PAT) • AANT are then analyzed locally

  7. Aims for SC4 & CSC • The main data to be analyzed are the AOD and the ESD data from the CSC production • The total amount of data is still small …few 100 GB • We aim first at a full distribution of the data to all interested sites • Perform tests of the computing model by analysis these data and measurement of relevant indicators • Reliability • Scalability • Throughput • Simply answer the questions … • How long does it take to analyze the expected 150 TB of AOD data corresponding to one year of running of the LHC • And what happens if several of you try to do it at the same time

  8. In the following …. • I will discuss the technical aspects necessary to achieve these goals • ATLAS has three grids with different middleware • LCG/EGEE • OSG • Nordugrid • Data Management is shared between the grids • But there are grid specific aspects • Workload Management is grid specific

  9. Data Management • I will review only some aspect related to Distributed Analysis • The ATLAS DDM is based on Don Quijote 2 (DQ2) • See the tutorial session on Thursday for more details • Two major tasks • Data registration • Datasets are used to increase the scalability • Tier-1’s are providing the necessary services • Data replication • Based on the gLite File Transfer Service (FTS) • Fallback to SRM or gridftp possible • Subscription are used to manage the actual file transfer

  10. How to access data on a Tier-2 Dataset catalog http CE Tier 0 rfio dcap gridftp nfs lrc protocol VOBOX SE LRC FTS Tier 2 Tier 1

  11. To distributed the data for analysis … • Real data • Data recorded and processed at CERN (Tier-0) • Data distribution via Tier-1 • Reprocessing • Reprocessing at Tier-1 • Distribution via other Tier-1 • Simulation • Simulation at Tier-1 and associated Tier-2 • Collection of data at Tier-1 • Distribution via other Tier-1

  12. For analysis … • The challenge is not the amount of data, but the management of the overlapping flow patterns • For SC4 we have a simpler aim … • Obtain a equal distribution for the current available simulated data • Data from the Tier-0 exercise is not useful for analysis • We will distribute only useful data

  13. Grid specific aspects • OSG • DDM fully in production since January • Site services also at Tier-2 centers • LCG/EGEE • Only dataset registration in production • New deployment model addresses this issue • Migration to new version 0.2.10 on the way • Nordugrid • Up to now only dataset registration • Final choice of file catalog still open

  14. New version 0.2.10 • Many essential features for Distributed Analysis • Support for the ATLAS Tier structure • Fallback from FTS to SRM and gridftp • Support for disk areas • Parallel operation of production and SC4 • And many more …. • Deployment is on the way • We hope to see in production very soon • OSG/Panda has to move to it asap • We should stay with this version until autumn • The success of Distributed Analysis during SC4 is crucially depending on the success of this version

  15. Local throughput • An Athena job has in the ideal case about 2MB/sec data throughput • The limit given by the persistency technology • Storegate-POOL-ROOT • Up to 50% of the capacity of a site is dedicated to analysis • We plan to access data locally via the native protocol (rfio, dcap, root etc) • Local network configuration should take that into account

  16. Data Management Summary • Data Distribution is essential for Distributed Analysis • DQ2 0.2.10 has the required features • There is a lot of work in front of us to control and validate data distribution • Local configuration determined by Athena I/O

  17. Workload management • Different middleware • Different teams • Different submission tools • Different submission tools are confusing to our users .. • We aim to obtain some common ATLAS UI following the ideas of pathena tool (see later) • But …. the priority for Distributed Analysis in the context of SC4 is to solve the technical problems within each grid infrastructure

  18. LCG/EGEE Sites Sites gLite UI gLite Resource Broker Sites Sites Dataset Location Catalog BDII

  19. Advantages of the gLite RB • Bulk submission • Increased performance and scalability • Improved Sandbox handling • Shallow resubmission • If you want to use a Resource Broker for Distributed Analysis, you want to use finally the gLite RB • Status • Being pushed into deployment with gLite 3.0 • Has not yet the same maturity as the LCG RB • Turing the gLite RB into production quality has evidently a high priority

  20. ATLAS EGEE/LCG Taskforce gLite LCG Bulk submission Multiple threads Submission: 0.3 sec/job 2 sec/job Matchmaking: 0.6 sec/job Job submission is not the limit any more (there are other limits ….)

  21. Plan B: Prodsys + CondorG Sites ProdDB Sites CondorG Executor Sites Sites Dataset Location Catalog CondorG Negotiator BDII

  22. Advantages • Profits from the experiences of the production • Proven record from the ongoing production • Better control on the implementation • CondorG by itself is also part of the Resource Broker • Performance • ~ 1 sec/job • Status • Coupled to the evolvement of the production system • Work on GANGA integration has started

  23. OSG • On OSG ATLAS uses PANDA for Workload management • Concept is similar to DIRAC and Alien • Fully integrated with the DDM • Status • In production since January • Work to optimize the support for analysis users is ongoing

  24. PANDA Architecture

  25. Advantages for DA • Integration with DDM • All data already available • Jobs start only when the data is available • Late job binding due to pilot jobs • Addresses grid inefficiencies • Fast response for user jobs • Integration of production and analysis activities

  26. Nordugrid • ARC middleware • Compact User Interface • 14 MB vs • New version has very good job submission performance • 0.5 to 1 sec/job • Status • Open questions: Support for all ATLAS users, DDM integration • Analysis capability seems to be coupled with the planned NG Tier-1

  27. ARC Job submission ARC UI Sites RLS Nordugrid IS

  28. WMS Summary • EGEE/LCG • gLite Resource Broker • Prodsys & CondorG • OSG • PANDA • Nordugrid • ARC • Different system – different problems • All job submission systems need work to optimize user analysis

  29. Job Priorities • Different infrastructures – different problems • EGEE • Job Priority WG • OSG/Panda • Separate cloud of analysis pilots • Nordugrid • Typically a site has several queues

  30. EGEE Job Priority WG • TCG working group • ATLAS & CMS • LHCb & Diligent observer • JRA1 (developers) • SA3 (deployment) • Several sites (NIKHEF, CNAF)

  31. ATLAS requirements • Split site resources in several shares • Production • Long jobs • Short jobs • Other jobs • Objectives • Production should not be pushed from a site • Analysis jobs should bypass production jobs • Local fairshare

  32. Proposed solution Role=Production Production 70% CE Long 20 % Role=Software Software 1 % CE Short 9 %

  33. Status • Based on VOMS Roles • Role=Production • Role=Software • No new middleware • A patch to the WMS has to be back ported • Test Installation • NIKHEF, CNAF • TCG & WLCG MB have agreed to the proposed solution • We are planning the move to the preproduction service • Move to the production sites in the not so far future • In the future • Support for physics groups • Dynamic settings • Requires new middleware (as GPBOX)

  34. PANDA • Increase the number of analysis pilots • Fast pickup of user jobs • First job can start in few seconds • Several techniques are being studied • Multitasking pilots • Analysis queues

  35. Job Priorities Summary • EGEE/LCG • New site configuration • OSG/Panda • Addressed by PANDA internal developments

  36. DA User Tools • pathena • PANDA tool for Distributed Analysis • Close to the Physics Analysis Tools group (PAT) • GANGA • Common project between LHCb & ATLAS • Used for Distributed Analysis on LCG

  37. pathena • Developed in close collaboration between PAT and PANDA • Local job athena ttbar_jobOptions.py • Grid job pathena ttbar_jobOptions.py –inDS csc11.005100.ttbar.recon.AOD… –split 10

  38. GANGA • Framework for job submission • Based on plugins for • Backends • LCG, gLite, CondorG, LSF, PBS • Applications • Athena, Executable • GPI abstraction layer • Python Command Line Interface (CLIP) • GUI

  39. GANGA GUI

  40. Other Tools • LJSF (Light Job submission framework) • Used for ATLAS software installations • Runs ATLAS transformations on the grid • No integration with DDM yet • DA with Prodsys • Special analysis transformations • Work to interface with GANGA has started

  41. DA Tools Summary • Main tools • pathena with PANDA/OSG • GANGA with Resource Broker/LCG • Integration with Athena as demonstrated by pathena is a clear advantage • GANGA plug-in mechanism allows in principle to obtain a unique interface • Priority for the GANGA team is to deliver a robust solution on LCG first

  42. Distributed Analysis in SC4 • Data distribution • Ongoing activity with the DDM operations team • Site configuration • We will move soon to the preproduction service • In few weeks we will then move to the production sites • Exercising the job priority model • Analysis in parallel to production • Scaling tests of the computing infrastructure • Measurement of the turnaround time for analysis of large datasets

  43. SC4 Timescale • We plan to perform DA tests in August and then later in autumn • The aim is to quantify the current characteristics of the Distributed Analysis systems • Scalability • Reliability • Throughput • Simply answer the questions … • How long does it take to analyze the expected 150 TB of AOD data corresponding to one year of running of the LHC • And what happens if several of you try to do it at the same time

  44. Tier-2 Site Requirements • Configuration of the batch system to support the job priority model • gLite 3.0 • Analysis and production in parallel • Data availability • Connect to DDM • Disk area • Sufficient local throughput

More Related