260 likes | 272 Views
Learn about the ATLAS Computing Model for data management and performance testing. Understand the challenges faced and successes achieved in distributing data and running Monte Carlo production.
E N D
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon 5-3-2007
Guidelines • Atlas Computing Model • Atlas Data management • Atlas tests • MC Production Atlas Tutorial
Processor’s Farm OtherTier-1s T1 T1 ESD AOD ATLAS Tier-1 Data Flow (2008) Reprocessing(s) Month later+…. Tier 1 Tier-2s cloud Mass Storage Tier-0 ESD RAW RAW 1.6 MB/event 320MB/s 43 MB/s diskbuffer 20 MB/s ESD 1 MB/event 200MB/s 51+20 MB/s diskstorage AOD 0.1 MB/event 20MB/s AOD +ESD ESD AOD Atlas Tutorial
User’s Analysis Processor’s OtherTier-1s T1 T1 Data Flow: Monte Carlo Production and User’s Analysis Tier-2s cloud Tier 1 Mass Storage MC HITS RDO diskbuffer MC HITS RDO diskstorage ESD AOD MC AOD Atlas Tutorial
Situation in 2006-2007 • Previous slides: How Atlas is expected to run when LHC data flows out. • Ante LHC running, i.e. before Nov 2007: • Monte-Carlo is processed at T1s and Tiers2 • All Data Production and Distribution tests exercised with MC Atlas Tutorial
Guidelines How to apply the ATLAS Computing Model? Atlas Data management Atlas Tutorial
ATLAS Data Management • Atlas uses 3 grids: LCG, OSG and NorduGrid with their own services Requires an ATLAS layer over the Grid middleware • Atlas Model of computing and data distribution: • Storage capacity spread in T1 sites • Different storage systems with different access technologies. • Computing power distributed over all Tiers, 1, 2, 3 to produce MC and process data • Tool to Distribute the data must: • Allow high performance and reliable data movement • Include information about data location and replication • Support multiple grid flavours. Atlas Tutorial
ATLAS Distribution tool : DDM • Stephane’s talk Atlas Tutorial
Guidelines Exercising the Model and preparing for real Data: Atlas Tests Atlas Tutorial
Tests TIER-1 Lyon • CSC: Computer System Commissioning: setup of tests and milestones. • Performance and functional tests of data transfers T0 to T1 and T1 to T2 • June-July 2006, September-October 2006 • Going on in 2007 (march 2007…) Goal: Get a stable and efficient system of data distribution. • New in 2007: CDR Computing Dress Rehearsal to exercise the full Atlas Data Model Atlas Tutorial
Performance tests T0=>T1 July 2006 • Almost reached the goal for few hours • Problems from various sides (availability of the sites, of the services, access to the catalogs ….) Atlas Tutorial
Atlas Performance tests T1=>T2 • ATLAS: continuous transfer from T1 to T2 sites initiated by the Tier 1 July 2006: Atlas Tutorial
Atlas Performance tests T1=>T2 July 2006 • Transfers to 7 Sites, T2 and non-T2 simultaneously • Some problem of limitations in the bandwidth for simultaneous transfers Atlas Tutorial
Performance tests T0=>T1 October October 2006 • Overall weaker throughput due to Multi-VO Simultaneous tests • Some drops understood (castor) but most not Atlas Tutorial
Multi-VO tests • 2 days tests involving multi VO • Generate data at Tier-0 according to the rate transfer of each experiment • Transfer to all sites Atlas Tutorial
Multi-VO tests • Transfer Alice-Atlas-CMS to LYON Tier-1 • Reached nominal transfer rates after few improvements… Atlas Tutorial
BAD OK Atlas Tutorial
Problems: identified or not Many improvements during 2006 year and increase in magnitude of the overall tests and fixes last quarter of 2006. But stable running not yet achieved for very different reasons: • Persistent and transient site failures; • Frequent failures for FTS transfers : big problem when multi VO runs. • LFC server hanging and failures: solved • Upgrade h/w on some sites for VO BOX: fixed • Memory leaks and other overflow conditions on DDM tool when running for long periods of time: fixed. • Throughput per stream per site seems to vary heavily (and some streams very slow): not understood Atlas Tutorial
Problems but also successes • Large file sizes always leads to much more stable running: Still not totally understood • Non Stable data generation (Castor configuration…) : Significant downtimes and problems maintaining constant stream for Tier-1 export • Monitoring: • Missing automated alarms • Missing clear view of errors, per site • Missing overall success metrics per dataset • Lack of Manpower! • BUT despite this list, many successes at the end of 2006 and very reactive and concerned behaviour of Lyon T1 and cloud. Atlas Tutorial
Guidelines Exercising the Model and preparing for real Data: Monte Carlo production Atlas Tutorial
Monte-Carlo Production in Lyon • Autumn 2006: executor installed in Lyon to distribute the production jobs within the Lyon Cloud. • Production shift organization • Setup of priorities to boost production jobs based on role in the certificate • Running: 291(max:955), • queued:443(max:1101), • Production rate:81%(max:100%) Result: Impressive increase in the efficiency of Data production in Lyon Cloud. Atlas Tutorial
AOD Replication : pre-testing TO FROM Data Transfer tested Data Transfer failed Data Transfer not tested in progress Atlas Tutorial
Monte-Carlo Production in Lyon Cloud 16% of LCG for 2006 22% for October-November Atlas Tutorial
Monte-Carlo Production in Lyon Cloud Atlas Tutorial
Monte-Carlo Production in Lyon • Still big room for improvement in the performances • Too high failure rate at or before start of jobs or due to site/middleware issues (no loss of CPU) • Failure at output: registration problem, srm , etc. Atlas Tutorial
Summary • Main baselines of the Atlas Computing Model established but still working on improvements. • 2006: decisive transition to operation mode: continuous production of high statistics MC samples; • Successful tests of Data Distribution in agreement with CSC (Computer System Commissioning) • Bottlenecks and problems still ahead but most are identified and work is going for a solution. • Improvements expected for : reliability, stability, monitoring. • Lyon site is very actively progressing towards full readiness for first data in the end of 2007 Atlas Tutorial