Grid Production Experience in the ATLAS Experiment

Grid Production Experience in the ATLAS Experiment Horst Severini University of Oklahoma Kaushik De University of Texas at Arlington D0-SAR Workshop, LaTech April 7, 2004

ATLAS Data Challenges • Original Goals (Nov 15, 2001) • Test computing model, its software, its data model, and to ensure the correctness of the technical choices to be made • Data Challenges should be executed at the prototype Tier centres • Data challenges will be used as input for a Computing Technical Design Report due by the end of 2003 (?) and for preparing a MoU • Current Status • Goals are evolving as we gain experience • Computing TDR ~end of 2004 • DC’s are ~yearly sequence of increasing scale & complexity • DC0 and DC1 (completed) • DC2 (2004), DC3, and DC4 planned • Grid deployment and testing is major part of DC’s LaTech D0SAR Meeting

ATLAS DC1: July 2002-April 2003Goals : Produce the data needed for the HLT TDR Get as many ATLAS institutes involved as possibleWorldwide collaborative activityParticipation : 56 Institutes • Australia • Austria • Canada • CERN • China • Czech Republic • Denmark * • France • Germany • Greece • Israel • Italy • Japan • Norway * • Poland • Russia • Spain • Sweden * • Taiwan • UK • USA * • * using Grid LaTech D0SAR Meeting

Process No. of events CPU Time CPU-days (400 SI2k) Volume of data kSI2k.months TB Simulation Physics evt. 107 415 30000 23 Simulation Single part. 3x107 125 9600 2 Lumi02 Pile-up 4x106 22 1650 14 Lumi10 Pile-up 2.8x106 78 6000 21 Reconstruction 4x106 50 3750 Reconstruction + Lvl1/2 2.5x106 (84) (6300) Total 690 (+84) 51000 (+6300) 60 DC1 Statistics (G. Poulard, July 2003) LaTech D0SAR Meeting

U.S. ATLAS DC1 Data Production • Year long process, Summer 2002-2003 • Played 2nd largest role in ATLAS DC1 • Exercised both farm and grid based production • 10 U.S. sites participating • Tier 1: BNL, Tier 2 prototypes: BU, IU/UC, Grid Testbed sites: ANL, LBNL, UM, OU, SMU, UTA (UNM & UTPA will join for DC2) • Generated ~2 million fully simulated, piled-up and reconstructed events • U.S. was largest grid-based DC1 data producer in ATLAS • Data used for HLT TDR, Athens physics workshop, reconstruction software tests... LaTech D0SAR Meeting

BNL - U.S. Tier 1, 2000 nodes, 5% for ATLAS, 10 TB, HPSS through Magda LBNL - pdsf cluster, 400 nodes, 5% for ATLAS (more if idle ~10-15% used), 1TB Boston U. - prototype Tier 2, 64 nodes Indiana U. - prototype Tier 2, 64 nodes UT Arlington - new 200 cpu’s, 50 TB Oklahoma U. - OSCER facility U. Michigan - test nodes ANL - test nodes, JAZZ cluster SMU - 6 production nodes UNM - Los Lobos cluster U. Chicago - test nodes U.S. ATLAS Grid Testbed LaTech D0SAR Meeting

U.S. Production Summary • Exercised both farm and grid based production • Valuable large scale grid based production experience * Total ~30 CPU YEARS delivered to DC1 from U.S. * Total produced file size ~20TB on HPSS tape system, ~10TB on disk. * Black - majority grid produced, Blue - majority farm produced LaTech D0SAR Meeting

DC1 Production Systems • Local batch systems - bulk of production • GRAT - grid scripts, generated ~50k files produced in U.S. • NorduGrid - grid system, ~10k files in Nordic countries • AtCom - GUI, ~10k files at CERN (mostly batch) • GCE - Chimera based, ~1k files produced • GRAPPA - interactive GUI for individual user • EDG/LCG - test files only • + systems I forgot… • More systems coming for DC2 • Windmill • GANGA • DIAL LaTech D0SAR Meeting

GRAT Software • GRid Applications Toolkit • developed by KD, Horst Severini, Mark Sosebee, and students • Based on Globus, Magda & MySQL • Shell & Python scripts, modular design • Rapid development platform • Quickly develop packages as needed by DC • Physics simulation (GEANT/ATLSIM) • Pileup production & data management • Reconstruction • Test grid middleware, test grid performance • Modules can be easily enhanced or replaced, e.g. EDG resource broker, Chimera, replica catalogue… (in progress) LaTech D0SAR Meeting

Prod. (UTA) MAGDA (BNL) Replica (local) 9 8 scratch 2 4 7 1,4,5,10 DC1 Remote Gatekeeper 5 Batch Execution 6 3 Param (CERN) GRAT Execution Model 1. Resource Discovery 2. Partition Selection 3. Job Creation 4. Pre-staging 5. Batch Submission 6. Job Parameterization 7. Simulation 8. Post-staging 9. Cataloging 10. Monitoring LaTech D0SAR Meeting

U.S. Middleware Evolution Globus Used for 95% of DC1 production Condor-G Used successfully for simulation Used successfully for simulation (complex pile-up workflow not yet) DAGMan Tested for simulation, used for all grid-based reconstruction Chimera LCG LaTech D0SAR Meeting

DC1 Production Experience • Grid paradigm works, using Globus • Opportunistic use of existing resources, run anywhere, from anywhere, by anyone... • Successfully exercised grid middleware with increasingly complex tasks • Simulation: create physics data from pre-defined parameters and input files, CPU intensive • Pile-up: mix ~2500 min-bias data files into physics simulation files, data intensive • Reconstruction: data intensive, multiple passes • Data tracking: multiple steps, one -> many -> many more mappings LaTech D0SAR Meeting

New Production System for DC2 • Goals • Automated data production system for all ATLAS facilities • Common database for all production - Oracle currently • Common supervisor run by all facilities/managers - Windmill • Common data management system - Don Quichote • Executors developed by middleware experts (Capone, LCG, NorduGrid, batch systems, CanadaGrid...) • Final verification of data done by supervisor LaTech D0SAR Meeting

Windmill - Supervisor • Supervisor development/U.S. DC production team • UTA: Kaushik De, Mark Sosebee, Nurcan Ozturk + students • BNL: Wensheng Deng, Rich Baker • OU: Horst Severini • ANL: Ed May • Windmill web page • http://www-hep.uta.edu/windmill • Windmill status • version 0.5 released February 23 • includes complete library of xml messages between agents • includes sample executors for local, pbs and web services • can run on any Linux machine with Python 2.2 • development continuing - Oracle production DB, DMS, new schema LaTech D0SAR Meeting

XML switch (Jabber Server) XMPP (XML) XMPP (XML) Web server SOAP supervisor agent executor agent Windmill Messaging • All messaging is XML based • Agents communicate using Jabber (open chat) protocol • Agents have same command line interface - GUI in future • Agents & web server can run at same or different locations • Executor accesses grid directly and/or thru web services LaTech D0SAR Meeting

Jabber Clients Jabber Clients Jabber Server XMPP Intelligent Agents • Supervisor/executor are intelligent communication agents • uses Jabber open source instant messaging framework • Jabber server routes XMPP messages - acts as XML data switch • reliable p2p asynchronous message delivery through firewalls • built in support for dynamic ‘directory’, ‘discovery’, ‘presence’ • extensible - we can add monitoring, debugging agents easily • provides ‘chat’ capability for free - collaboration among operators • Jabber grid proxy under development (LBNL - Agarwal) LaTech D0SAR Meeting

Core Windmill Libraries • interact.py - command line interface library • agents.py - common intelligent agent library • xmlkit.py - xml creation (generic) and parsing library • messages.py - xml message creation (specific) • proddb.py - production database methods for oracle, mysql, local, dummy, and possibly other options • supervise.py - supervisor methods to drive production • execute.py - executor methods to run facilities LaTech D0SAR Meeting

Capone Executor • Various executors are being developed • Capone - U.S. VDT executor by U. of Chicago and Argonne • Lexor - LCG executor mostly by Italian groups • NorduGrid, batch (Munich), Canadian, Australian(?) • Capone is based on GCE (Grid Computing Environment) • (VDT Client/Server, Chimera, Pegasus, Condor, Globus) • Status: • Python module • Process “thread” for each job • Archive of managed jobs • Job management • Grid monitoring • Aware of key parameters (e.g. available CPUs, jobs running) LaTech D0SAR Meeting

Message protocols Web Service Jabber Translation Windmill ADA CPE Grid Stub DonQuixote Capone Architecture • Message interface • Web Service • Jabber • Translation level • Windmill • CPE (Capone Process Engine) • Processes • Grid • Stub • DonQuixote from Marco Mambelli LaTech D0SAR Meeting

Windmill Screenshots LaTech D0SAR Meeting

LaTech D0SAR Meeting

Web Services Example LaTech D0SAR Meeting

Conclusion • Data Challenges are important for ATLAS software and computing infrastructure readiness • Grids will be the default testbed for DC2 • U.S. playing a major role in DC2 planning & production • 12 U.S. sites ready to participate in DC2 • Major U.S. role in production software development • Test of new grid production system imminent • Physics analysis will be emphasis of DC2 - new experience • Stay tuned LaTech D0SAR Meeting

Grid Production Experience in the ATLAS Experiment