1 / 23

Grid Production Experience in the ATLAS Experiment

Grid Production Experience in the ATLAS Experiment. Horst Severini University of Oklahoma Kaushik De University of Texas at Arlington D0-SAR Workshop, LaTech April 7, 2004. ATLAS Data Challenges. Original Goals (Nov 15, 2001)

lidia
Download Presentation

Grid Production Experience in the ATLAS Experiment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Production Experience in the ATLAS Experiment Horst Severini University of Oklahoma Kaushik De University of Texas at Arlington D0-SAR Workshop, LaTech April 7, 2004

  2. ATLAS Data Challenges • Original Goals (Nov 15, 2001) • Test computing model, its software, its data model, and to ensure the correctness of the technical choices to be made • Data Challenges should be executed at the prototype Tier centres • Data challenges will be used as input for a Computing Technical Design Report due by the end of 2003 (?) and for preparing a MoU • Current Status • Goals are evolving as we gain experience • Computing TDR ~end of 2004 • DC’s are ~yearly sequence of increasing scale & complexity • DC0 and DC1 (completed) • DC2 (2004), DC3, and DC4 planned • Grid deployment and testing is major part of DC’s LaTech D0SAR Meeting

  3. ATLAS DC1: July 2002-April 2003Goals : Produce the data needed for the HLT TDR Get as many ATLAS institutes involved as possibleWorldwide collaborative activityParticipation : 56 Institutes • Australia • Austria • Canada • CERN • China • Czech Republic • Denmark * • France • Germany • Greece • Israel • Italy • Japan • Norway * • Poland • Russia • Spain • Sweden * • Taiwan • UK • USA * • * using Grid LaTech D0SAR Meeting

  4. Process No. of events CPU Time CPU-days (400 SI2k) Volume of data kSI2k.months TB Simulation Physics evt. 107 415 30000 23 Simulation Single part. 3x107 125 9600 2 Lumi02 Pile-up 4x106 22 1650 14 Lumi10 Pile-up 2.8x106 78 6000 21 Reconstruction 4x106 50 3750 Reconstruction + Lvl1/2 2.5x106 (84) (6300) Total 690 (+84) 51000 (+6300) 60 DC1 Statistics (G. Poulard, July 2003) LaTech D0SAR Meeting

  5. U.S. ATLAS DC1 Data Production • Year long process, Summer 2002-2003 • Played 2nd largest role in ATLAS DC1 • Exercised both farm and grid based production • 10 U.S. sites participating • Tier 1: BNL, Tier 2 prototypes: BU, IU/UC, Grid Testbed sites: ANL, LBNL, UM, OU, SMU, UTA (UNM & UTPA will join for DC2) • Generated ~2 million fully simulated, piled-up and reconstructed events • U.S. was largest grid-based DC1 data producer in ATLAS • Data used for HLT TDR, Athens physics workshop, reconstruction software tests... LaTech D0SAR Meeting

  6. BNL - U.S. Tier 1, 2000 nodes, 5% for ATLAS, 10 TB, HPSS through Magda LBNL - pdsf cluster, 400 nodes, 5% for ATLAS (more if idle ~10-15% used), 1TB Boston U. - prototype Tier 2, 64 nodes Indiana U. - prototype Tier 2, 64 nodes UT Arlington - new 200 cpu’s, 50 TB Oklahoma U. - OSCER facility U. Michigan - test nodes ANL - test nodes, JAZZ cluster SMU - 6 production nodes UNM - Los Lobos cluster U. Chicago - test nodes U.S. ATLAS Grid Testbed LaTech D0SAR Meeting

  7. U.S. Production Summary • Exercised both farm and grid based production • Valuable large scale grid based production experience * Total ~30 CPU YEARS delivered to DC1 from U.S. * Total produced file size ~20TB on HPSS tape system, ~10TB on disk. * Black - majority grid produced, Blue - majority farm produced LaTech D0SAR Meeting

  8. DC1 Production Systems • Local batch systems - bulk of production • GRAT - grid scripts, generated ~50k files produced in U.S. • NorduGrid - grid system, ~10k files in Nordic countries • AtCom - GUI, ~10k files at CERN (mostly batch) • GCE - Chimera based, ~1k files produced • GRAPPA - interactive GUI for individual user • EDG/LCG - test files only • + systems I forgot… • More systems coming for DC2 • Windmill • GANGA • DIAL LaTech D0SAR Meeting

  9. GRAT Software • GRid Applications Toolkit • developed by KD, Horst Severini, Mark Sosebee, and students • Based on Globus, Magda & MySQL • Shell & Python scripts, modular design • Rapid development platform • Quickly develop packages as needed by DC • Physics simulation (GEANT/ATLSIM) • Pileup production & data management • Reconstruction • Test grid middleware, test grid performance • Modules can be easily enhanced or replaced, e.g. EDG resource broker, Chimera, replica catalogue… (in progress) LaTech D0SAR Meeting

  10. Prod. (UTA) MAGDA (BNL) Replica (local) 9 8 scratch 2 4 7 1,4,5,10 DC1 Remote Gatekeeper 5 Batch Execution 6 3 Param (CERN) GRAT Execution Model 1. Resource Discovery 2. Partition Selection 3. Job Creation 4. Pre-staging 5. Batch Submission 6. Job Parameterization 7. Simulation 8. Post-staging 9. Cataloging 10. Monitoring LaTech D0SAR Meeting

  11. U.S. Middleware Evolution Globus Used for 95% of DC1 production Condor-G Used successfully for simulation Used successfully for simulation (complex pile-up workflow not yet) DAGMan Tested for simulation, used for all grid-based reconstruction Chimera LCG LaTech D0SAR Meeting

  12. DC1 Production Experience • Grid paradigm works, using Globus • Opportunistic use of existing resources, run anywhere, from anywhere, by anyone... • Successfully exercised grid middleware with increasingly complex tasks • Simulation: create physics data from pre-defined parameters and input files, CPU intensive • Pile-up: mix ~2500 min-bias data files into physics simulation files, data intensive • Reconstruction: data intensive, multiple passes • Data tracking: multiple steps, one -> many -> many more mappings LaTech D0SAR Meeting

  13. New Production System for DC2 • Goals • Automated data production system for all ATLAS facilities • Common database for all production - Oracle currently • Common supervisor run by all facilities/managers - Windmill • Common data management system - Don Quichote • Executors developed by middleware experts (Capone, LCG, NorduGrid, batch systems, CanadaGrid...) • Final verification of data done by supervisor LaTech D0SAR Meeting

  14. Windmill - Supervisor • Supervisor development/U.S. DC production team • UTA: Kaushik De, Mark Sosebee, Nurcan Ozturk + students • BNL: Wensheng Deng, Rich Baker • OU: Horst Severini • ANL: Ed May • Windmill web page • http://www-hep.uta.edu/windmill • Windmill status • version 0.5 released February 23 • includes complete library of xml messages between agents • includes sample executors for local, pbs and web services • can run on any Linux machine with Python 2.2 • development continuing - Oracle production DB, DMS, new schema LaTech D0SAR Meeting

  15. XML switch (Jabber Server) XMPP (XML) XMPP (XML) Web server SOAP supervisor agent executor agent Windmill Messaging • All messaging is XML based • Agents communicate using Jabber (open chat) protocol • Agents have same command line interface - GUI in future • Agents & web server can run at same or different locations • Executor accesses grid directly and/or thru web services LaTech D0SAR Meeting

  16. Jabber Clients Jabber Clients Jabber Server XMPP Intelligent Agents • Supervisor/executor are intelligent communication agents • uses Jabber open source instant messaging framework • Jabber server routes XMPP messages - acts as XML data switch • reliable p2p asynchronous message delivery through firewalls • built in support for dynamic ‘directory’, ‘discovery’, ‘presence’ • extensible - we can add monitoring, debugging agents easily • provides ‘chat’ capability for free - collaboration among operators • Jabber grid proxy under development (LBNL - Agarwal) LaTech D0SAR Meeting

  17. Core Windmill Libraries • interact.py - command line interface library • agents.py - common intelligent agent library • xmlkit.py - xml creation (generic) and parsing library • messages.py - xml message creation (specific) • proddb.py - production database methods for oracle, mysql, local, dummy, and possibly other options • supervise.py - supervisor methods to drive production • execute.py - executor methods to run facilities LaTech D0SAR Meeting

  18. Capone Executor • Various executors are being developed • Capone - U.S. VDT executor by U. of Chicago and Argonne • Lexor - LCG executor mostly by Italian groups • NorduGrid, batch (Munich), Canadian, Australian(?) • Capone is based on GCE (Grid Computing Environment) • (VDT Client/Server, Chimera, Pegasus, Condor, Globus) • Status: • Python module • Process “thread” for each job • Archive of managed jobs • Job management • Grid monitoring • Aware of key parameters (e.g. available CPUs, jobs running) LaTech D0SAR Meeting

  19. Message protocols Web Service Jabber Translation Windmill ADA CPE Grid Stub DonQuixote Capone Architecture • Message interface • Web Service • Jabber • Translation level • Windmill • CPE (Capone Process Engine) • Processes • Grid • Stub • DonQuixote from Marco Mambelli LaTech D0SAR Meeting

  20. Windmill Screenshots LaTech D0SAR Meeting

  21. LaTech D0SAR Meeting

  22. Web Services Example LaTech D0SAR Meeting

  23. Conclusion • Data Challenges are important for ATLAS software and computing infrastructure readiness • Grids will be the default testbed for DC2 • U.S. playing a major role in DC2 planning & production • 12 U.S. sites ready to participate in DC2 • Major U.S. role in production software development • Test of new grid production system imminent • Physics analysis will be emphasis of DC2 - new experience • Stay tuned LaTech D0SAR Meeting

More Related