1 / 30

Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge

Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge. D. Bonacorsi (on behalf of INFN-CNAF Tier-1 staff and the CMS experiment). ACAT 2005 X Int. Work. on Advanced Computing & Analysis Techniques in Physics Research

libba
Download Presentation

Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards the operations ofthe Italian Tier-1 for CMS:lessons learned from the CMS Data Challenge D. Bonacorsi (on behalf of INFN-CNAF Tier-1 staff and the CMS experiment) ACAT 2005 X Int. Work. on Advanced Computing & Analysis Techniques in Physics Research May 22nd-27th, 2005 - DESY, Zeuthen, Germany

  2. Outline • The past • CMS operational environment during the Data Challenge • focus on INFN-CNAF Tier-1 resources and set-up • The present • lessons learned from the challenge • The future • … try to apply what we (think we) learned…

  3. The INFN-CNAF Tier-1 • Located at INFN-CNAF centre, in Bologna (Italy) • computing facility for INFN HNEP community • one of the main nodes of GARR network • Multi-experiment Tier-1 • LHC experiments + AMS, Argo, BaBar, CDF, Magic, Virgo, … • evolution: dynamic sharing of resources among involved exps • CNAF is a relevant Italian site from a Grid perspective • partecipating to LCG, EGEE, INFN-GRID projects • support to R&D activities, develop/testing prototypes/components • “traditional” access to resources granted also, but more ‘manpower-consuming’

  4. Tier-1 resources and services • computing power • CPU farms for ~1300 kSI2k + few dozen of servers • biproc boxes [320 @0.8-2.4 GHz, 350 @3 GHz], ht activated • storage • on-line data access (disks) • IDE, SCSI, FC; 4 NAS systems [~60 TB], 2 SAN systems [~225 TB] • custodial task on MSS (tapes in Castor HSM system) • Stk L180 lib - overall ~18 TB • Stk 5500 lib - 6 LTO-2 [~ 240 TB] + 2 9940b [~ 136 TB] (more to be installed) • networking • T1 LAN • rack FE switches with 2xGbps uplinks to core switch (ds  via GE to core) • upgrade foreseen  rack Gb switches • 1 Gbps T1 link to WAN (+1 Gbps is for Service Challenge) • will be 10 Gbps [Q3 2005] • More: • infrastructure (electric power, UPS, etc.) • system administration, database services administration, etc. • support to experiment-specific activities • coordination with Tier-0, other Tier-1’s, and Tier-n’s (n>1)

  5. The CMS Data Challenge: what and how • Validate the CMS computing model on a sufficient number of Tier-0/1/2’s  large scale test of the computing/analysis models • CMS Pre-Challenge Production (PCP) • up to digitization (needed as input for DC) • mainly non-grid productions… • …but also grid prototypes (CMS/LCG-0, LCG-1, Grid3) Generation Simulation Digitization ~70M Monte Carlo events (20M with Geant-4) produced, 750K jobs ran, 3500 KSI2000 months, 80 TB of data • CMS Data Challenge (DC04) • Reconstruction and analysis on CMS data sustained over 2 months at the 5% of the LHC rate at full luminosity  25% of start-up lumi • sustain a 25 Hz reconstruction rate in the Tier-0 farm • register data and metadata to a world-readable catalogue • distribute reconstructed data from Tier-0 to Tier-1/2’s • analyze reconstructed data at the Tier-1/2’s as they arrive • monitor/archive information on resources and processes Reconstruction Analysis • not a CPU challenge.. aimed to the demostration of feasibility of the full chain

  6. Dataset metadata RLS Computer farm JDL Grid (LCG) Scheduler LCG-x DAG Grid3 DAGMan (MOP) Job metadata job Push data or info Chimera VDL Virtual Data Catalogue job job Planner Pull info job PCP set-up: a hybrid model Phys.Group asks for a new dataset Production Manager defines assignments RefDB shell scripts Data-level query Local Batch Manager BOSS DB Job level query McRunjob Site Manager starts an assignment

  7. PCP grid-based prototypes Strong INFN contribution to crucial PCP production, in both: CMS prod. steps: INFN/CMS [%] Generation 13 % Simulation 14 % ooHitformatting 21 % Digitisation 18 %  “traditional” production  constant work of integration in CMS between: CMS softwareand production tools  evolving EDG-XLCG-Y middleware in several phases:  CMS “Stress Test” with EDG<1.4, then: PCP on the CMS/LCG-0 testbed PCP on LCG-1 … towards DC04 with LCG-2 EU-CMS: submit to LCG scheduler  CMS-LCG “virtual” Regional Center 0.5 MevtsGeneration [“heavy” pythia] (~2000 jobs ~8 hours* each, ~10 KSI2000 months) ~ 2.1 MevtsSimulation [CMSIM+OSCAR] (~8500 jobs ~10hours* each, ~130 KSI2000 months) ~2 TB data OSCAR: ~0.6 Mevts on LCG-1 * PIII 1GHz CMSIM: ~1.5 Mevts on CMS/LCG-0

  8. Tier-0 Tier-2 T0 data distribution agents GDB LCG-2 Services Physicist ORCA RECO Job T2 Disk-SE Tier-1 disk-SE EBs ORCA Job T1 data distribution agents RefDB IB TMDB Castor MSS T1 Castor-SE fake on-line process POOL RLS catalogue ORCA Job Castor MSS T1 disk-SE Global DC04 layout and workflow Hierarchy of RCs & data distribution chains 3 distinct scenarios deployed and tested

  9. INFN-specific DC04 workflow disk-SE Export Buffer Transfer Management DB • data movement T0T1 • data custodial task: interface to MSS • data movement T1T2 for “real-time analysis” CNAF T1 TRA-Agent data flow local MySQL LTO-2 tape library T1 Castor SE query db update db Legnaro T2 SAFE-Agent REP-Agent T1 disk-SE T2 disk-SE Basic issues addressed at T1:

  10. An example: Data flow during just 1 day of DC04 CNAF T1 Castor SE eth I/O input from SE-EB CNAF T1 Castor SE TCP connections Just one day: Apr, 19th RAM memory CNAF T1 disk-SE eth I/O input from Castor SE green Legnaro T2 disk-SE eth I/O input from Castor SE

  11. DC04 outcome (grand-summary + focus on INFN T1) • reconstruction/data-transfer/analysis may run at 25 Hz • automatic registration and distribution of data, key role of the TMDB • was the embrional PhEDEx! • support a (reasonable) variety of different data transfer tools and set-up • Tier-1’s: different performances, related to operational choices • SRB, LCG Replica Manager and SRM investigated: see CHEP04 talk • INFN T1: good performance of LCG-2 chain (PIC T1 also) • register all data and metadata (POOL) to a world-readable catalogue • RLS: good as a global file catalogue, bad as a global metadata catalogue • analyze the reconstructed data at the Tier-1’s as data arrive • LCG components: dedicated bdII+RB; UIs, CEs+WNs at CNAF and PIC • real-time analysis at Tier-2’s was demonstrated to be possible • ~15k jobs submitted • time window between reco data availability - start of analysis jobs can be reasonably low (i.e. 20 mins) • reduce number of files (i.e. increase <#events>/<#files>) • more efficient use of bandwidth • reduce overhead of commands • address scalability of MSS systems (!)

  12. Learn from DC04 lessons… • Some general considerations may apply: • although a DC is experiment-specific, maybe its conclusions are not • an “experiment-specific” problem is better addressed if conceived as a “shared” one in a shared Tier-1 • an experiment DC just provides hints, real work gives insight  crucial role of the experiments at the Tier-1 • find weaknesses of CASTOR MSS system in particular operating conditions • stress-test new LSF farm with official production jobs by CMS • testing DNS-based load-balancing by serving data for production and/or analysis from CMS disk-servers • test new components, newly installed/upgraded Grid tools, etc… • find bottleneck and scalability problems in DB services • give feedback on monitoring and accounting activities • …

  13. T1 today: farming What changed since DC04? • Migration in progress: • OS • RH v.7.3  SLC v.3.0.4 • middleware • upgrade to LCG v.2.4.0 • install/manage WNs/servers • lcfgng  Quattor • integration LCG-Quattor • batch scheduler • Torque+Maui LSF v.6.0 • queues for prod/anal • manage Grid interfacing RUNNING PENDING Total nb. jobs Max nb. slots • Analysis • “controlled” and “fake” (DC04) vs. “unpredictable” and “real” (now) • T1 provides one full LCG site + 2 dedicated RBs/bdII + support to CRABers • Interoperability: always an issue, even harder in a transition period • dealing with ~2-3 sub-farms in use by ~10 exps (in prod) • resource use optimization: still to be achieved  see [ N.DeFilippis session II day 3]

  14. T1 today: storage What changed since DC04? see [ P.P.Ricci session II day 3] • Storage issues (1/2): disks • driven by requirements of LHC data processing at the Tier-1 • i.e. simultaneous access of ~PBs of data from ~1000 nodes at high rate • main focus is on robust, load-balanced, redundant solutions to grant proficient and stable data access to distributed users • namely: “make both sw and data accessible from jobs running on WNs” • remote access (gridftp) and local access (rfiod, xrootd, GPFS) services, afs/nfs to share exps’ sw on WNs, filesystems tests, specific problem solving in analysts’ daily operations, CNAF participation to SC2/3, etc. • a SAN approach with a parallel filesystem on-top looks promising • Storage issues (2/2): tapes • CMS DC04 helped to focus some problems: • LTO-2 drives not efficiently used by exps in production at T1 • performance degradation increases as file size decreases • hangs on locate/fskip after ~100 not-sequential reading • not-full tapes are labelled ‘RDONLY’ after 50-100 GB written only • CASTOR performances increase with clever pre-staging of files • some reliability achieved only on sequential/pre-staged reading • solutions? • from the HSM sw side: fix coming with CASTOR v.2 (Q2 2005)? • from the HSM hw side: test 9940b drives in prod (see PIC T1) • from the exp side: explore possible solutions • e.g. file-merging in coupling PhEDEx tool to CMS production system • e.g. depict a pure-disk buffer in front of MSS disantangled from CASTOR

  15. gw/UI UI Current CMS set-up at the Tier-1 CMS activities at the Tier-1 Castor MSS local CMS Local prod Resources manag. remote “access” shared Grid prod/anal PhEDEx agents “control” logical grouping CPUs Castor disk buffer SE Overflow Operations control WN WN WN Import-Export Buffer WN WN WN PhEDEx agents WN WN WN SE Analysis disks WN WN WN LSF SE Core Production disks WN WN WN CE WN WN WN Grid.it / LCG layer

  16. PhEDEx in CMS • PhEDEx (Physics Experiment Data Export) used by CMS • overall infrastructure for data transfer management in CMS • allocation and transfers of CMS physics data among Tier-0/1/2’s • different datasets move on bidirectional routes among Regional Centers • data should reside on SEs (e.g. gsiftp or srm protocols) • components: • TMDB  from DC04 • files, topology, subscriptions... • coherent set of sw agents, loosely coupled, inter-operating and communicating with TMDB blackboard • e.g. agents for data allocation (based on site data subscriptions), file import/export, migration to MSS, routing (based on implemented topologies), monitoring, etc… INFN T1 mainly on data transfer… INFN T1 mainly on prod/anal • born, and growing fast… • >70 TB known to PhEDEx, >150 TB total replicated

  17. PhEDEx transfer rates T0INFN T1 CNAF T1 diskserver I/O weekly daily Rate out of CERN Tier-0

  18. PhEDEx at INFN • INFN-CNAF is a T1 ‘node’ in PhEDEx • CMS DC04 experience was crucial to start-up PhEDEX in INFN • CNAF node operational since the beginning • First phase (Q3/4 2004): • Agent code development + focus on operations: T0T1 transfers • >1 TB/day T0T1 demonstrated feasible • … but the aim is not to achieve peaks, but to sustain them in normal operations • Second phase (Q1 2005): • PhEDEx deployment in INFN to Tier-n, n>1: • “distributed” topology scenario • Tier-n agents run at remote sites, not at the T1: know-how required, T1 support • already operational at Legnaro, Pisa, Bari, Bologna An example: data flow to T2’s in daily operations (here: a test with ~2000 files, 90 GB, with no optimization) ~450 Mbps CNAF T1  LNL-T2 ~205 Mbps CNAF T1  Pisa-T2 • Third phase (Q>1 2005): • Many issues.. e.g. stability of service, dynamic routing, coupling PhEDEx to CMS official production system, PhEDEx involvement in SC3-phaseII, etc…

  19. CMS MonteCarlo productions • CMS production system evolving into a permanent effort • strong contribution of INFN T1 to CMS productions • 252 ‘assignments’ in PCP-DC04, for all production step [both local and Grid] • plenty of assignments (simulation only) now running on LCG (Italy+Spain) • CNAF support for ‘direct’ submitters + backup SEs provided for Spain • currently, digitization/DST efficiently run locally (mostly at T1) • produced data hence injected in the CMS data distribution infrastructure • future of T1 productions: rounds of “scheduled” reprocessing ~12.9 Mevts assegnati ~11.8 Mevts prodotti DST production at INFN T1

  20. coming next: Service Challenge (SC3) • data transfer and data serving in real use-cases • review existing infrastructure/tools and give a boost • details of the challenge are currently under definition • Two phases: • Jul05: SC3 “throughput” phase • Tier-0/1/2 simultaneous import/export, MSS involved • move real files, store on real hw • >Sep05: SC3 “service” phase • small scale replica of the overall system • modest throughput, main focus is on testing in a quite complete environment, with all the crucial components • space for experiment-specific tests and inputs • Goals • test crucial components, push to prod-quality, and measure. • towards the next production service • INFN T1 participated in SC2, and is joining SC3

  21. Conclusions • INFN-CNAF T1 is quite young but ramping-up towards stable production-quality services • optimized use of resources + interfaces to the Grid • policy/HR to support experiments at the Tier-1 • the Tier-1 actively partecipated to CMS DC04 • good hints: identified bottlenecks in managing resources, scalability, … • Learn the lessons: overall revision of CMS set-up at the T1 • involves both Grid and non-Grid access • first results are encouraging, success of daily operations • local/Grid productions + distributed analysis are running… • Go ahead: • long path… • next step on it: preparation for SC3, also with CMS applications

  22. Back-up slides

  23. PhEDEx transfer rates T0INFN T1 CNAF T1 diskserver I/O weekly daily Rate out of CERN Tier-0

  24. PhEDEx transfer rates T0INFN T1 CNAF T1 diskserver I/O weekly daily Rate out of CERN Tier-0

  25. Lethal injuries only CNAF “autopsy” of DC04 • Agents drain data from SE-EB down to CNAF/PIC T1’s and • land directly on a Castor SE buffer •  it occurred that in DC04 these files were many and small • So: for any file on the Castor SE fs, a tape migration is • foreseen with a given policy, regardless of their size/nb..  this strongly affected data transfer at CNAF T1 (MSS below is STK tape lib with LTO-2 tapes) • Castor stager scalability issues many small files (mostly 500B-50kB)  stager db bad performances of stager db for >300-400k entries (may need more RAM?) • CNAF fast set-up of an additional stager in DC04: basically worked • REP-Agent cloned to transparently continue replication to disk-SEs • tape library LTO-2 issues high nb. segments on tape  bad tape read/write performances, LTO-2 SCSI errors, repositioning failures, slow migration to tape and delays in the TMDB “SAFE”-labelling, inefficient tape space usage A–posteriori solutions:consider a disk-based Import Buffer in front of MSS… [ see next slide ] DC04

  26. Non-lethal injuries CNAF “autopsy” of DC04 • minor (?) Castor/tape-library issues • Castor filename length (more info: Castor ticket CT196717) • ext3 file-system corruption on a partition of the old stager • tapes blocked in the library • several crashes/hanging of the TRA-Agent (rate: ~ 3 times per week)  created from time to time some backlogs, nevertheless fast to be recovered •  post-mortem analysis in progress • experience with the Replica Manager interface e.g. files of size 0 created at destination when trying to replicate from Castor SE some data which are temporarily not accessible for stager (or other) problems on the Castor side •  needs further tests to achieve reproducibility and then Savannah reports • Globus-MDS Information System instabilities (rate: ~ once per week)  some temporary stop of data transfer (i.e. ‘no SE found’ means ‘no replicas’) • RLS instabilities (rate: ~ once per week)  some temporary stop of data transfer (cannot both list replicas and (de)register files) • SCSI driver problems on CNAF disk-SE (rate: just once but affected fake-analysis)  disks mounted but no I/O: under investigation constant and painful debugging…

  27. CMS DC04: number and sizes of files DC04 data time window: 51 (+3) days March 11th – May 3rd >3k files for >750 GB May 1st May 2nd Global CNAF network activity ~340 Mbps (>42 MB/s) sustained for ~5 hours (max was 383.8 Mbps)

  28. POOL RLS catalogue Description of RLS usage Local POOL catalogue TMDB Tier-1 Transfer agent SRB GMCAT Replica Manager RM/SRM/SRB EB agents 4. Copy files to Tier-1’s Resource Broker 3. Copy/delete files to/from export buffers 5. Submit analysis job LCG ORCA Analysis Job Configuration agent 2. Find Tier-1 Location (based on metadata) 6. Process DST and register private data CNAF RLS replica 1. Register Files XML Publication Agent ORACLE mirroring Specific client tools: POOL CLI, Replica Manager CLI, C++ LRC API based programs, LRC java API tools (SRB/GMCAT), Resource Broker

  29. Tier-0 data distrib. agents EB Tier-0 in DC04 Architecture built on: Tier-0 • Systems • LSF batch system • 3 racks, 44 nodes each, dedicated: tot 264 CPUs • Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT • Dedicated cmsdc04 batch queue, 500 RUN-slots • Disk servers: • DC04 dedicated stager, with 2 pools • 2 pools: IB and GDB, 10 + 4 TB • Export Buffers • EB-SRM ( 4 servers, 4.2 TB total ) • EB-SRB ( 4 servers, 4.2 TB total ) • EB-SE ( 3 servers, 3.1 TB total ) • Databases • RLS (Replica Location Service) • TMDB (Transfer Management DB) • Transfer steering • Agents steering data transfers • on a dedicated node (close monitoring..) • Monitoring Services GDB ORCA RECO Job RefDB IB TMDB fake on-line process POOL RLS catalogue Castor

  30. CMS Production tools • CMS production tools (OCTOPUS) • RefDB • Contains production requests with all needed parameters to produce the dataset and the details about the production process • MCRunJob • Evolution of IMPALA: more modular (plug-in approach) • Tool/framework for job preparation and job submission • BOSS • Real-time job-dependent parameter tracking. The running job standard output/error are intercepted and filtered information are stored in BOSS database. The remote updator is based on MySQL but a remote updator based on R-GMA is being developed.

More Related