1 / 28

Analysis of the ATLAS Rome Production Experience on the LCG Computing Grid

Analysis of the ATLAS Rome Production Experience on the LCG Computing Grid. Simone Campana, CERN/INFN EGEE User Forum, CERN (Switzerland) March 1 st – 3 rd 2006. Outline. The ATLAS Experiment The Computing Model and the Data Challenges The LCG Computing Grid

kort
Download Presentation

Analysis of the ATLAS Rome Production Experience on the LCG Computing Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of the ATLAS Rome Production Experience on the LCG Computing Grid Simone Campana, CERN/INFN EGEE User Forum, CERN (Switzerland) March 1st – 3rd 2006

  2. Outline • The ATLAS Experiment • The Computing Model and the Data Challenges • The LCG Computing Grid • Overview and architecture • The ATLAS Production System • The Rome Production on LCG • Report and numbers of the production • Achievements with respect to DC2 • Standing Issues and possible improvements • Conclusions EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  3. The ATLAS Experiment AToroidaLApparatuS for LHC View of LHC @ CERN ATLAS EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  4. Analysis farm Analysis farm Re-reconstruction farm Monte Carlo farm The ATLAS Computing Model Tier-0 Online filter farm RAW RAW ESD AOD RAW Reconstruction farm ESD AOD RAW ESD AOD Tier-1 RAW ESD AOD MC • The ATLAS computing can NOT rely on a SINGLE computer center model • Amount of required resources is too large • For the 1st year of data taking • 50.6 MSI2k of CPU • 16.9 TB of space on Tape • 25.4 TB of space on Disk • ATLAS decided to embrace the GRID paradigm • High level of decentralization • Sites are organized in a multi-tier structure • Hierarchical model • Tiers are defined by ROLE in the ATLAS computing • Tier-0 at CERN • Record RAW data • Distribute second copy to Tier-1s • Calibrate anddo first-pass reconstruction • Tier-1 centers • Manage permanent storage – RAW, simulated, processed • Capacity for reprocessing, bulk analysis • Tier-2 centers • Monte Carlo event simulation • End-user analysis • In Grid terminology ATLAS is aVirtual Organization ESD, AOD RAW Tier-2 MC SelectedESD, AOD ESD, AOD EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  5. Data Challenges • Data Challenge: validation of the Computing and Data Model and test the complete software suite • Full simulation and reprocessing of data as if coming from the detector • Same software and computing infrastructure to be employed for data taking • ATLAS ran two major Data Challenges • DC1 in 2002-2003 (with direct access to local resources + NorduGrid, see later) • DC2 in July - December 2004 (completely in GRID environment) • Large scale production in January – June 2005 • Friendly called “Rome Production” • provide data for physics studies for the ATLAS Rome Workshop in June 2005. • Can be considered totally equivalent to a real Data Challenge • Same methodology • Large number of events produced • Offered a unique opportunity to test improvements in the production framework, the Grid middleware and the reconstruction software. • ATLAS resources span three different grids: the LCG, NorduGrid and OSG • In this talk I will present the “Rome production” experience on the LHC Computing Grid infrastructure EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  6. The LCG Infrastructure May 2005 140 Grid sites 34 countries 12000 CPUs 8 PetaBytes EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  7. LCG architecture • The Logging and Bookkeeping service • keeps the state information of a job • allows the user to query its status • Each Computing Element is the front-end to a local batch system • manages a pool of Worker Nodes where the job is eventually executed. • Limited User Credentials (Proxies) can be automatically renewed • through a Proxy Service. The Workload Management System responsible for the management and monitoring of jobs • A set of services running on the Resource Broker machine • match job requirements to the available resources • schedule the job for execution to an appropriate Computing Element • track the job status • allow to retrieve their job output. EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  8. LCG Architecture • Allows the user to move files in and out of the Grid, replicate files among different Storage Elements and locate files. • Files are stored in Storage Elements • Disk only or with tape backend. • A number protocols allows data transfer • GridFTP is the most commonly used • Files are registered in a central catalogue • Replica Location Service • keeps information about file location and about some file metadata. The Data Management System The Information System • Provides information about the Grid resources and their status. • Info generated on every service • And published by the GRIS • Propagated in a hierarchical structure • GIIS at every site • BDII as central collector. EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  9. LCG architecture • Accounting • logs resource usage and traces user jobs • Monitoring Services • visualize and record the status of LCG resources • Different systems in place • R-GMA • GridICE • ... EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  10. The executors offer an interface to the underlying Grid middleware. The LCG executor, Lexor, provides an interface to the native LCG WMS. File upload/download rely on Grid-specific clients tools The ATLAS Data Management System (DonQuijote) ensures high-level data management across different Grids. The job monitoring is performed through Grid-specific tools. In LCG, information collected from the production database and the GridICE server are merged and published through an interactive web interface. An ATLAS central database holds Grid-neutral information about jobs. A “supervisor” agent distributes jobs to Grid-specific agents called “executors” follows up their status, validates them in case of success or flags them for resubmission. Don Quijote Don Quijotte ProdDB ProdDB supervisor supervisor supervisor supervisor supervisor supervisor supervisor supervisor LCG LCG GRID3 NG batch GRID3 NG batch executor executor executor executor executor executor executor executor RLS RLS RLS RLS RLS RLS batch LCG batch GRID3 LCG GRID3 NG NG batch LCG GRID3 NG The ATLAS production system EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  11. Task Flow for ATLAS production Bytestream Raw Digits ESD Digits (RDO) MCTruth Bytestream Raw Digits Mixing Reconstruction Events HepMC Hits MCTruth Geant4 Digitization Bytestream Raw Digits ESD Digits (RDO) MCTruth Events HepMC Hits MCTruth Pythia Reconstruction Geant4 Digitization Digits (RDO) MCTruth Events HepMC Hits MCTruth Geant4 Pile-up Bytestream Raw Digits ESD Bytestream Raw Digits Mixing Reconstruction Events HepMC Hits MCTruth Digits (RDO) MCTruth Geant4 Pile-up Bytestream Raw Digits 20 TB 5 TB 20 TB 30 TB ~5 TB Event Mixing Digitization (Pile-up) Reconstruction Detector Simulation Event generation Byte stream TB Physics events Min. bias Events Piled-up events Mixed events Mixed events With Pile-up Volume of data for 107 events EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  12. Event generation Detector Simulation Digitization (Pile-up) ESD AOD Digits (RDO) MCTruth Events HepMC Hits MCTruth Reconstruction Geant4 Digitization Pythia Digits (RDO) MCTruth Events HepMC Hits MCTruth Geant4 Pile-up But in fact … • Only part of the full chain has been used/tested for Rome Production • No ByteStream • No Event Mixing • The reconstruction has been performed on digitized events and only partially on piled-up events EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  13. Rome Production experience on LCG • In average 8 concurrent instances of Lexor were active on the native LCG-2 system. • Four people were controlling the production process • checking for job failures • interacting with the middleware developers and the LCG Experiment Integration Support team. • The production for the Rome workshop consisted of: • A total of 380k jobs submitted to the native LCG-2 WMS • 109k simulation jobs • 106k digitization jobs • 125k reconstruction jobs • 40k pile-up jobs • An total of 1.4M files stored in LCG Storage Elements • corresponding to an amount of data of about 45TB • This is a clear improvement with respect to DC2 where 91.5k jobs in total ran on LCG-2 and no reconstruction was performed. EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  14. Rome Production experience on LCG Number of jobs per day Data Challenge 2 Rome Production EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  15. Jobs distributed to 45 different computing resources Ratio generally proportional to the size of the cluster indicates an overall good job distribution. No site in particular ran large majority of jobs. The site with the largest number of CPU resources (CERN), contributed for about 11% of the ATLAS production. Other major sites ran between 5% and 8% of the jobs each. Achievement toward a more robust and fault-tolerant system does not rely on a small number of large computing centers. Rome Production experience on LCG The percentage of ATLAS jobs run at each LCG site EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  16. Improvements: the Information System • An unstable Information System can affect production in many aspects. • Jobs might not match the full set of resources and flood a restricted number of sites • causing overload of some site services and leaving other available resources unused. • Data management commands might fail to transfer input and output files • waste a large amount of CPU cycles • cause an overhead for the submission system and the production team • For the Rome production, several aspects were improved • Fixes in BDII software • BDII deployed as Load Balanced Service • Multiple backends under a DNS switch • This reduced the single point of failure effect for both job submission and data management during the job execution. EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  17. Improvements: Site Configuration • The main source of failures during DC2. • No procedure in place during DC2 • Treated on a case-by-case basis • Unthinkable in the long timescale since • LCG infrastructure counts a very large number of resources • LCG grows very rapidly and is widely distributed. • Many sites started a careful monitoring of the number of job failures • and developed automatic tools to identify problematic nodes. • The LCG Operation team developed a series of automatic tools for site sanity controls • See next slide EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  18. Improvements: Site Configuration • The GIIS monitor • checks the consistency of the information published by the site in the IS • almost real time. • The Site Functional Tests • running every day at every site • test the correct configuration of the Worker Nodes and the interaction with Grid services. • Now, can include VO specific tests and allows a VO specific view EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  19. Improvements: Site Configuration • Freedom of Choosing Resources • Allows the user to exclude mis-configured resources from the BDII • Generally based on SFT • Possible to whitelist or blacklist a resource • Can exclude separately Storage and Computing Resources of the same site EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  20. The LCG Workload Management System is highly automated designed to reduce the human intervention at the minimum consists in a complex set of services interacting with external components. This complexity caused a certain unreliability of the WMS during DC2. The system became more robust before the Rome production several bug fixes and optimizations in the WMS workflow. The heterogeneous and dynamic nature of a Grid environment implies a certain level of unreliability. ATLAS application improved to cope with such unreliability. The production team and the LCG operation and support teams gathered a lot of experience during DC2 and benefited from this experience at the time of Rome Production. Improvements: WMS and others EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  21. Issues (and possible improvements) • Failure Rates and Causes are shown in table • Still failure rate is quite high (~ 48%) • Different failures imply different amounts of resource losses • Obviously the most serious reason is Data Management EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  22. Data Management: issues • NO Reliable File Transfer service in place during Rome production. • Data movement performed through LCG DM client tools. • LCG DM tools did not provide timeout and retry capabilities • WORKAROUND: a timeout and possible retry was implemented in Lexor at some point • LCG DM tools not always ensure consistency between files in the SE and entries in the catalog • If the catalog is down or unreachable or the operation is killed prematurely • WORKAROUND: manual cleanup was needed • Data access on mass storage systems was very problematic • Data need to be moved (staged) from tape to disk before being accessed. • The middleware could not ensure the existence/persistency of data on disk • WORKAROUND: manual pre-staging of files was carried out by the production team • The ATLAS strategy for file distribution must be (re)thought. • Output files chaotically spread around 143 different Storage Elements. • A replication schema for frequently accessed file was not in place • Complicates analysis of the reconstructed samples and the production itself. EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  23. Data Management: improvements • Timeout and Retry capabilities introduced natively in the LCG DM tools. • Also improved to guarantee atomic operations. • A new catalog, the LCG File Catalog, has been developed • More stable • Easier problem tracking • Better performance and reliability • The Storage Resource Manager interface introduced as a front-end to every SE. • agreed on between experiments and middleware developers • standardize storage access and management • Offers more functionality for MSS access • A reliable File Transfer Service developed within the EGEE project. • It is a SERVICE • Allows to replicate files between SEs in a reliable way. • Built on top of gridFTP and SRM • Capable to deal with data transfer from/to MSS EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  24. Data Management: improvements • FTS and SRM SEs have been intensively tested during Service Challenge 3 (ongoing) • Throughput exercise started in July 2005 and continuing at a low rate even now. • Data Distribution from CERN T0 to several T1s • Some issues have been addressed • Many issues already fixed • Others being fixed by Service Challenge 4 (April 2006) • In general, very positive feedback from experiments EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  25. Strategy for files distribution • The new ATLAS DDM is already in place • Fully tested and employed during SC3 • Data throughput from CERN to ATLAS Tier 1s • Target (80MB/s sustained for a week) fully achieved • ATLAS DM is being now integrated with ATLAS Production System. • New DistributedATLAS Data Management system • Enforce concept of “logical dataset” • collection of files being moved and located as a unique entity. • Dataset Subscription Model • the site declare the interest in holding a dataset • ATLAS agents trigger migration of files • Integrated with LFC, FTS and SRM EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  26. The Workload Management: issues • The performance of the WMS for job submission and handling • generally acceptable in normal conditions • … but degrade under stress. • WORKAROUND: several RBs dedicated to ATLAS and with different hardware solutions have been deployed • The EGEE project will provide an enhanced WMS • Possibility of bulk submission, bulk matching and bulk queries • Improved communication with Computing Elements at sites • Possible improvement of job submission speed and job dispatching. • Some preliminary tests show promising results • Still several issue must be clarified EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  27. Monitoring: issues and improvements • Lack of VO specific information about jobs at the sites • GridICE sensors deployed in every site, but not correctly configured everywhere. • Partial information, difficult to interpret. • Queries to the ATLAS Production Database could cause an excessive load. • The error diagnostics should be improved • performed parsing executor log files and querying the DB • should be formalized in proper tools • Real-time job output inspection would have been helpful • especially to investigate causes of hanging jobs. • An ATLAS team is building a global job monitoring system • Based on the current tools • Possibly integrating new components (R-GMA etc …) EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

  28. Conclusions • The Rome Production on the LCG infrastructure has been an overall successful exercise • Exercised the ATLAS production system • Contributed to the testing of the ATLAS Computing and Data model • Stress-tested the LCG infrastructure • … and produced a lot of simulated data for the physicists!!! • Must be seen as the consequence of several improvements • in the Grid middleware • In the ATLAS components • In the LCG operations • Still, several components need improvements • both in terms of reliability and performance • Production still requires a lot of human attention • Issues have been addressed to the relevant parties and a lot of work has been done since Rome Production • Preliminary tests show promising improvements • Will be evaluated fully in Service Challenge 4 (April 2006) EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006

More Related