1 / 26

ATLAS Computing Model

Learn about the ATLAS Computing Model for data management and performance testing. Understand the challenges faced and successes achieved in distributing data and running Monte Carlo production.

ramirezm
Download Presentation

ATLAS Computing Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon 5-3-2007

  2. Guidelines • Atlas Computing Model • Atlas Data management • Atlas tests • MC Production Atlas Tutorial

  3. Processor’s Farm OtherTier-1s T1 T1 ESD AOD ATLAS Tier-1 Data Flow (2008) Reprocessing(s) Month later+…. Tier 1 Tier-2s cloud Mass Storage Tier-0 ESD RAW RAW 1.6 MB/event 320MB/s 43 MB/s diskbuffer 20 MB/s ESD 1 MB/event 200MB/s 51+20 MB/s diskstorage AOD 0.1 MB/event 20MB/s AOD +ESD ESD AOD Atlas Tutorial

  4. User’s Analysis Processor’s OtherTier-1s T1 T1 Data Flow: Monte Carlo Production and User’s Analysis Tier-2s cloud Tier 1 Mass Storage MC HITS RDO diskbuffer MC HITS RDO diskstorage ESD AOD MC AOD Atlas Tutorial

  5. Situation in 2006-2007 • Previous slides: How Atlas is expected to run when LHC data flows out. • Ante LHC running, i.e. before Nov 2007: • Monte-Carlo is processed at T1s and Tiers2 • All Data Production and Distribution tests exercised with MC Atlas Tutorial

  6. Guidelines How to apply the ATLAS Computing Model? Atlas Data management Atlas Tutorial

  7. ATLAS Data Management • Atlas uses 3 grids: LCG, OSG and NorduGrid with their own services Requires an ATLAS layer over the Grid middleware • Atlas Model of computing and data distribution: • Storage capacity spread in T1 sites • Different storage systems with different access technologies. • Computing power distributed over all Tiers, 1, 2, 3 to produce MC and process data • Tool to Distribute the data must: • Allow high performance and reliable data movement • Include information about data location and replication • Support multiple grid flavours. Atlas Tutorial

  8. ATLAS Distribution tool : DDM • Stephane’s talk Atlas Tutorial

  9. Guidelines Exercising the Model and preparing for real Data: Atlas Tests Atlas Tutorial

  10. Tests TIER-1 Lyon • CSC: Computer System Commissioning: setup of tests and milestones. • Performance and functional tests of data transfers T0 to T1 and T1 to T2 • June-July 2006, September-October 2006 • Going on in 2007 (march 2007…) Goal: Get a stable and efficient system of data distribution. • New in 2007: CDR Computing Dress Rehearsal to exercise the full Atlas Data Model Atlas Tutorial

  11. Performance tests T0=>T1 July 2006 • Almost reached the goal for few hours • Problems from various sides (availability of the sites, of the services, access to the catalogs ….) Atlas Tutorial

  12. Atlas Performance tests T1=>T2 • ATLAS: continuous transfer from T1 to T2 sites initiated by the Tier 1 July 2006: Atlas Tutorial

  13. Atlas Performance tests T1=>T2 July 2006 • Transfers to 7 Sites, T2 and non-T2 simultaneously • Some problem of limitations in the bandwidth for simultaneous transfers Atlas Tutorial

  14. Performance tests T0=>T1 October October 2006 • Overall weaker throughput due to Multi-VO Simultaneous tests • Some drops understood (castor) but most not Atlas Tutorial

  15. Multi-VO tests • 2 days tests involving multi VO • Generate data at Tier-0 according to the rate transfer of each experiment • Transfer to all sites Atlas Tutorial

  16. Multi-VO tests • Transfer Alice-Atlas-CMS to LYON Tier-1 • Reached nominal transfer rates after few improvements… Atlas Tutorial

  17. BAD OK Atlas Tutorial

  18. Problems: identified or not Many improvements during 2006 year and increase in magnitude of the overall tests and fixes last quarter of 2006. But stable running not yet achieved for very different reasons: • Persistent and transient site failures; • Frequent failures for FTS transfers : big problem when multi VO runs. • LFC server hanging and failures: solved • Upgrade h/w on some sites for VO BOX: fixed • Memory leaks and other overflow conditions on DDM tool when running for long periods of time: fixed. • Throughput per stream per site seems to vary heavily (and some streams very slow): not understood Atlas Tutorial

  19. Problems but also successes • Large file sizes always leads to much more stable running: Still not totally understood • Non Stable data generation (Castor configuration…) : Significant downtimes and problems maintaining constant stream for Tier-1 export • Monitoring: • Missing automated alarms • Missing clear view of errors, per site • Missing overall success metrics per dataset • Lack of Manpower! • BUT despite this list, many successes at the end of 2006 and very reactive and concerned behaviour of Lyon T1 and cloud. Atlas Tutorial

  20. Guidelines Exercising the Model and preparing for real Data: Monte Carlo production Atlas Tutorial

  21. Monte-Carlo Production in Lyon • Autumn 2006: executor installed in Lyon to distribute the production jobs within the Lyon Cloud. • Production shift organization • Setup of priorities to boost production jobs based on role in the certificate • Running: 291(max:955), • queued:443(max:1101), • Production rate:81%(max:100%) Result: Impressive increase in the efficiency of Data production in Lyon Cloud. Atlas Tutorial

  22. AOD Replication : pre-testing TO FROM Data Transfer tested Data Transfer failed Data Transfer not tested in progress Atlas Tutorial

  23. Monte-Carlo Production in Lyon Cloud 16% of LCG for 2006 22% for October-November Atlas Tutorial

  24. Monte-Carlo Production in Lyon Cloud Atlas Tutorial

  25. Monte-Carlo Production in Lyon • Still big room for improvement in the performances • Too high failure rate at or before start of jobs or due to site/middleware issues (no loss of CPU) • Failure at output: registration problem, srm , etc. Atlas Tutorial

  26. Summary • Main baselines of the Atlas Computing Model established but still working on improvements. • 2006: decisive transition to operation mode: continuous production of high statistics MC samples; • Successful tests of Data Distribution in agreement with CSC (Computer System Commissioning) • Bottlenecks and problems still ahead but most are identified and work is going for a solution. • Improvements expected for : reliability, stability, monitoring. • Lyon site is very actively progressing towards full readiness for first data in the end of 2007 Atlas Tutorial

More Related