Santiago González de la Hoz ( Santiago.Gonzalez@ific.uv.es )

IFIC Instituto de Física Corpuscular CSIC-Universitat de València Spain ATLAS Data Challenge 2: A massive Monte Carlo Production on the Grid Santiago González de la Hoz (Santiago.Gonzalez@ific.uv.es) on behalf of ATLAS DC2 Collaboration EGC 2005 Amsterdam, 14/02/2005 k

Overview • Introduction • ATLAS experiment • Data Challenge program • ATLAS production system • DC2 production phases • The 3 Grid flavours (LCG, GRID3 and NorduGrid) • ATLAS DC2 production • Distributed analysis system • Conclusions EGC 2005 Santiago.Gonzalez@ific.uv.es

LHC (CERN) Introduction: LHC/CERN Mont Blanc, 4810 m Geneva EGC 2005 Santiago.Gonzalez@ific.uv.es

The challenge of the LHC computing Storage – Raw recording rate 0.1 – 1 GBytes/sec Accumulating at 5-8 PetaBytes/year 10 PetaBytes of disk Processing – 200,000 of today’s fastest PCs EGC 2005 Santiago.Gonzalez@ific.uv.es

Introduction: ATLAS • Detector for the study of high-energy proton-proton collision. • The offline computing will have to deal with an output event rate of 100 Hz. i.e 109 events per year with an average event size of 1 Mbyte. • Researchers are spread all over the world. EGC 2005 Santiago.Gonzalez@ific.uv.es

Introduction: Data Challenges • Scope and Goals: • In 2002 ATLAS computing planned a first series of Data Challenges (DC’s) in order to validate its: • Computing Model • Software • Data Model • The major features of the DC1 were: • The development and deployment of the software required for the production of large event samples • The production of those samples involving institutions worldwide. • ATLAS collaboration decided to perform the DC2 and in the future the DC3 using theGrid middlewaredeveloped in several Grid projects (Grid flavours) like: • LHC Computing Grid project (LCG), to which CERN is committed • GRID3 • NorduGRID EGC 2005 Santiago.Gonzalez@ific.uv.es

ATLAS production system • The production database, which contains abstract job definitions; • The windmillsupervisor that reads the production database for job definitions and present them to the different GRID executors in an easy-to-parse XML format; • The Executors, one for each GRID flavor, that receive the job-definitions in XML format and convert them to the job description language of that particular GRID; • Don Quijote, the Atlas Data Management System, moves files from their temporary output locations to their final destination on some Storage Element and registers the files in the Replica Location Service of that GRID. • In order to handle the task of ATLAS DC2 an automated production system was designed • The ATLAS production system consists of 4 components: EGC 2005 Santiago.Gonzalez@ific.uv.es

DC2 production phases Bytestream Raw Digits Task Flow for DC2 data ESD Digits (RDO) MCTruth Bytestream Raw Digits Mixing Reconstruction Events HepMC Hits MCTruth Geant4 Digitization Bytestream Raw Digits ESD Digits (RDO) MCTruth Events HepMC Hits MCTruth Pythia Reconstruction Geant4 Digitization Digits (RDO) MCTruth Events HepMC Hits MCTruth Geant4 Pile-up Bytestream Raw Digits ESD Bytestream Raw Digits Mixing Reconstruction Events HepMC Hits MCTruth Digits (RDO) MCTruth Geant4 Pile-up Bytestream Raw Digits 20 TB 5 TB 20 TB 30 TB ~5 TB Event Mixing Digitization (Pile-up) Reconstruction Detector Simulation Event generation Byte stream Persistency: Athena-POOL TB Physics events Min. bias Events Piled-up events Mixed events Mixed events With Pile-up EGC 2005 Santiago.Gonzalez@ific.uv.es Volume of data for 107 events

DC2 production phases • The ATLAS DC2 which started in July 2004 finished the simulation part at the end of September 2004. • 10 million events (100000 jobs) were generated and simulated using the three Grid Flavors: • The Grid technologies have provided the tools to generate a large Monte Carlo simulation samples • The digitization and Pile-up part was completed in December. The pile-up was done on a sub-sample of 2 M events. • The event mixing and byte-stream production are going on EGC 2005 Santiago.Gonzalez@ific.uv.es

The 3 Grid flavors • LCG (http://lcg.web.cern.ch/LCG/) • The job of the LHC Computing Grid Project – LCG – is to prepare the computing infrastructure for the simulation, processing and analysis of LHC data for all four of the LHC collaborations. This includes both the common infrastructure of libraries, tools and frameworks required to support the physics application software, and the development and deployment of the computing services needed to store and process the data, providing batch and interactive facilities for the worldwide community of physicists involved in LHC. • NorduGrid (http://www.nordugrid.org/) • The aim of the NorduGrid collaboration is to deliver a robust, scalable, portable and fully featured solution for a global computational and data Grid system. NorduGrid develops and deploys a set of tools and services – the so-called ARC middleware, which is a free software. • Grid3 (http://www.ivdgl.org/grid2003/) • The Grid3 collaboration has deployed an international Data Grid with dozens of sites and thousands of processors. The facility is operated jointly by the U.S. Grid projects iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC experiments ATLAS and CMS. • Both Grid3 and NorduGrid have similar approaches using the same foundations (GLOBUS) as LCG but with slightly different middleware. EGC 2005 Santiago.Gonzalez@ific.uv.es

The 3 Grid flavors: LCG • This infrastructure has been operating since 2003. • The resources used (computational and storage) are installed at a large number of Regional Computing Centers, interconnected by fast networks. • 82 sites, 22 countries (This number is evolving very fast) • 6558 TB • ~7269 CPUs (shared) EGC 2005 Santiago.Gonzalez@ific.uv.es

The 3 Grid flavors: NorduGRID • NorduGrid is a research collaboration established mainly across Nordic Countries but includes sites from other countries. • They contributed to a significant part of the DC1 (using the Grid in 2002). • It supports production on non-RedHat 7.3 platforms • 11 countries, 40+ sites, ~4000 CPUs, • ~30 TB storage EGC 2005 Santiago.Gonzalez@ific.uv.es

The 3 Grid flavors: GRID3 Sep 04 • 30 sites, multi-VO • shared resources • ~3000 CPUs (shared) • The deployed infrastructure has been in operation since November 2003 • At this moment running 3 HEP and 2 Biological applications • Over 100 users authorized to run in GRID3 EGC 2005 Santiago.Gonzalez@ific.uv.es

ATLAS DC2 production on: LCG, GRID3 and NorduGrid G4 simulation total # Validated Jobs Day EGC 2005 Santiago.Gonzalez@ific.uv.es

Typical job distribution on: LCG, GRID3 and NorduGrid EGC 2005 Santiago.Gonzalez@ific.uv.es

Distributed Analysis system: ADA • The physicists want to use the Grid to perform the analysis of the data too. • ADA (ATLAS Distributed Analysis) project aims at putting together all software components to facilitate the end-user analysis. • DIAL: It defines the job components (dataset, task, applications, etc..). Together with LSF or Condor provides “interactivity” ( a low response time). • ATPROD: production system to be used for low mass scale • ARDA: Analysis system to be interfaced to EGEE middleware • The ADA architecture EGC 2005 Santiago.Gonzalez@ific.uv.es

Lessons learned from DC2 • Main problems • The production system was in development during DC2 phase. • The beta status of the services of the Grid caused troubles while the system was in operation • For example, the Globus RLS, the Resource Broker and the information system were unstable at the initial phase. • Specially on LCG, lack of uniform monitoring system. • The mis-configuration of sites and site stability related problems. • Main achievements • To have an automatic production system making use of Grid infrastructure. • 6 TB (out of 30 TB) of data have been moved among the different Grid flavours using Don Quijote servers. • 235000 jobs were submitted by the production system • 250000 logical files were produced and 2500-3500 jobs per day distributed over the three Grid flavours per day. EGC 2005 Santiago.Gonzalez@ific.uv.es

Conclusions • The generation and simulation of events for ATLAS DC2 have been completed using 3 flavours of Grid Technology. • They have been proven to be usable in a coherent way for a real production and this is a major achievement. • This exercise has taught us that all the involved elements (Grid middleware, production system, deployment and monitoring tools) need improvements. • Between the start of DC2 in July 2004 and the end of September 2004 (it corresponds G4-simulation phase), the automatic production system has submitted 235000 jobs, they consumed ~1.5 million SI2K months of cpu and produced more than 30TB of physics data. • ATLAS is also pursuing a model for distributed analysis which would improve the productivity of end users by profiting from Grid available resources. EGC 2005 Santiago.Gonzalez@ific.uv.es

Backup Slides EGC 2005 Santiago.Gonzalez@ific.uv.es

execution sites (grid) execution sites (grid) Supervisor-Executors Jabber communication pathway supervisors executors 1. lexor 2. dulcinea 3. capone 4. legacy numJobsWanted executeJobs getExecutorData getStatus fixJob killJob Windmill Don Quijote (file catalog) Prod DB (jobs database) EGC 2005 Santiago.Gonzalez@ific.uv.es

NorduGRID: ARC features • ARC is based on Globus Toolkit with core services replaced • Currently uses Globus Toolkit 2 • Alternative/extended Grid services: • Grid Manager that • Checks user credentials and authorization • Handles jobs locally on clusters (interfaces to LRMS) • Does stage-in and stage-out of files • Lightweight User Interface with built-in resource broker • Information System based on MDS with a NorduGrid schema • xRSL job description language (extended Globus RSL) • Grid Monitor • Simple, stable and non-invasive EGC 2005 Santiago.Gonzalez@ific.uv.es

LCG software • LCG-2 core packages: • VDT (Globus2, condor) • EDG WP1 (Resource Broker, job submission tools) • EDG WP2 (Replica Management tools) + lcg tools • One central RMC and LRC for each VO, located at CERN, ORACLE backend • Several bits from other WPs (Config objects, InfoProviders, Packaging…) • GLUE 1.1 (Information schema) + few essential LCG extensions • MDS based Information System with significant LCG enhancements (replacements, simplified (see poster)) • Mechanism for application (experiment) software distribution • Almost all components have gone through some reengineering • robustness • scalability • efficiency • adaptation to local fabrics • The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture) EGC 2005 Santiago.Gonzalez@ific.uv.es

Grid3 software • Grid environment built from core Globus and Condor middleware, as delivered through the Virtual Data Toolkit (VDT) • GRAM, GridFTP, MDS, RLS, VDS • …equipped with VO and multi-VO security, monitoring, and operations services • …allowing federation with other Grids where possible, eg. CERN LHC Computing Grid (LCG) • USATLAS: GriPhyN VDS execution on LCG sites • USCMS: storage element interoperability (SRM/dCache) • Delivering the US LHC Data Challenges EGC 2005 Santiago.Gonzalez@ific.uv.es

ATLAS DC2 (CPU) EGC 2005 Santiago.Gonzalez@ific.uv.es

Typical job distribution on LCG EGC 2005 Santiago.Gonzalez@ific.uv.es

Typical Job distribution on Grid3 EGC 2005 Santiago.Gonzalez@ific.uv.es

Jobs distribution on NorduGrid EGC 2005 Santiago.Gonzalez@ific.uv.es

Santiago González de la Hoz ( Santiago.Gonzalez@ific.uv.es )

Santiago González de la Hoz ( Santiago.Gonzalez@ific.uv.es )

Presentation Transcript

International Preparatory School- Santiago, Chile