300 likes | 431 Views
Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge. D. Bonacorsi (on behalf of INFN-CNAF Tier-1 staff and the CMS experiment). ACAT 2005 X Int. Work. on Advanced Computing & Analysis Techniques in Physics Research
E N D
Towards the operations ofthe Italian Tier-1 for CMS:lessons learned from the CMS Data Challenge D. Bonacorsi (on behalf of INFN-CNAF Tier-1 staff and the CMS experiment) ACAT 2005 X Int. Work. on Advanced Computing & Analysis Techniques in Physics Research May 22nd-27th, 2005 - DESY, Zeuthen, Germany
Outline • The past • CMS operational environment during the Data Challenge • focus on INFN-CNAF Tier-1 resources and set-up • The present • lessons learned from the challenge • The future • … try to apply what we (think we) learned…
The INFN-CNAF Tier-1 • Located at INFN-CNAF centre, in Bologna (Italy) • computing facility for INFN HNEP community • one of the main nodes of GARR network • Multi-experiment Tier-1 • LHC experiments + AMS, Argo, BaBar, CDF, Magic, Virgo, … • evolution: dynamic sharing of resources among involved exps • CNAF is a relevant Italian site from a Grid perspective • partecipating to LCG, EGEE, INFN-GRID projects • support to R&D activities, develop/testing prototypes/components • “traditional” access to resources granted also, but more ‘manpower-consuming’
Tier-1 resources and services • computing power • CPU farms for ~1300 kSI2k + few dozen of servers • biproc boxes [320 @0.8-2.4 GHz, 350 @3 GHz], ht activated • storage • on-line data access (disks) • IDE, SCSI, FC; 4 NAS systems [~60 TB], 2 SAN systems [~225 TB] • custodial task on MSS (tapes in Castor HSM system) • Stk L180 lib - overall ~18 TB • Stk 5500 lib - 6 LTO-2 [~ 240 TB] + 2 9940b [~ 136 TB] (more to be installed) • networking • T1 LAN • rack FE switches with 2xGbps uplinks to core switch (ds via GE to core) • upgrade foreseen rack Gb switches • 1 Gbps T1 link to WAN (+1 Gbps is for Service Challenge) • will be 10 Gbps [Q3 2005] • More: • infrastructure (electric power, UPS, etc.) • system administration, database services administration, etc. • support to experiment-specific activities • coordination with Tier-0, other Tier-1’s, and Tier-n’s (n>1)
The CMS Data Challenge: what and how • Validate the CMS computing model on a sufficient number of Tier-0/1/2’s large scale test of the computing/analysis models • CMS Pre-Challenge Production (PCP) • up to digitization (needed as input for DC) • mainly non-grid productions… • …but also grid prototypes (CMS/LCG-0, LCG-1, Grid3) Generation Simulation Digitization ~70M Monte Carlo events (20M with Geant-4) produced, 750K jobs ran, 3500 KSI2000 months, 80 TB of data • CMS Data Challenge (DC04) • Reconstruction and analysis on CMS data sustained over 2 months at the 5% of the LHC rate at full luminosity 25% of start-up lumi • sustain a 25 Hz reconstruction rate in the Tier-0 farm • register data and metadata to a world-readable catalogue • distribute reconstructed data from Tier-0 to Tier-1/2’s • analyze reconstructed data at the Tier-1/2’s as they arrive • monitor/archive information on resources and processes Reconstruction Analysis • not a CPU challenge.. aimed to the demostration of feasibility of the full chain
Dataset metadata RLS Computer farm JDL Grid (LCG) Scheduler LCG-x DAG Grid3 DAGMan (MOP) Job metadata job Push data or info Chimera VDL Virtual Data Catalogue job job Planner Pull info job PCP set-up: a hybrid model Phys.Group asks for a new dataset Production Manager defines assignments RefDB shell scripts Data-level query Local Batch Manager BOSS DB Job level query McRunjob Site Manager starts an assignment
PCP grid-based prototypes Strong INFN contribution to crucial PCP production, in both: CMS prod. steps: INFN/CMS [%] Generation 13 % Simulation 14 % ooHitformatting 21 % Digitisation 18 % “traditional” production constant work of integration in CMS between: CMS softwareand production tools evolving EDG-XLCG-Y middleware in several phases: CMS “Stress Test” with EDG<1.4, then: PCP on the CMS/LCG-0 testbed PCP on LCG-1 … towards DC04 with LCG-2 EU-CMS: submit to LCG scheduler CMS-LCG “virtual” Regional Center 0.5 MevtsGeneration [“heavy” pythia] (~2000 jobs ~8 hours* each, ~10 KSI2000 months) ~ 2.1 MevtsSimulation [CMSIM+OSCAR] (~8500 jobs ~10hours* each, ~130 KSI2000 months) ~2 TB data OSCAR: ~0.6 Mevts on LCG-1 * PIII 1GHz CMSIM: ~1.5 Mevts on CMS/LCG-0
Tier-0 Tier-2 T0 data distribution agents GDB LCG-2 Services Physicist ORCA RECO Job T2 Disk-SE Tier-1 disk-SE EBs ORCA Job T1 data distribution agents RefDB IB TMDB Castor MSS T1 Castor-SE fake on-line process POOL RLS catalogue ORCA Job Castor MSS T1 disk-SE Global DC04 layout and workflow Hierarchy of RCs & data distribution chains 3 distinct scenarios deployed and tested
INFN-specific DC04 workflow disk-SE Export Buffer Transfer Management DB • data movement T0T1 • data custodial task: interface to MSS • data movement T1T2 for “real-time analysis” CNAF T1 TRA-Agent data flow local MySQL LTO-2 tape library T1 Castor SE query db update db Legnaro T2 SAFE-Agent REP-Agent T1 disk-SE T2 disk-SE Basic issues addressed at T1:
An example: Data flow during just 1 day of DC04 CNAF T1 Castor SE eth I/O input from SE-EB CNAF T1 Castor SE TCP connections Just one day: Apr, 19th RAM memory CNAF T1 disk-SE eth I/O input from Castor SE green Legnaro T2 disk-SE eth I/O input from Castor SE
DC04 outcome (grand-summary + focus on INFN T1) • reconstruction/data-transfer/analysis may run at 25 Hz • automatic registration and distribution of data, key role of the TMDB • was the embrional PhEDEx! • support a (reasonable) variety of different data transfer tools and set-up • Tier-1’s: different performances, related to operational choices • SRB, LCG Replica Manager and SRM investigated: see CHEP04 talk • INFN T1: good performance of LCG-2 chain (PIC T1 also) • register all data and metadata (POOL) to a world-readable catalogue • RLS: good as a global file catalogue, bad as a global metadata catalogue • analyze the reconstructed data at the Tier-1’s as data arrive • LCG components: dedicated bdII+RB; UIs, CEs+WNs at CNAF and PIC • real-time analysis at Tier-2’s was demonstrated to be possible • ~15k jobs submitted • time window between reco data availability - start of analysis jobs can be reasonably low (i.e. 20 mins) • reduce number of files (i.e. increase <#events>/<#files>) • more efficient use of bandwidth • reduce overhead of commands • address scalability of MSS systems (!)
Learn from DC04 lessons… • Some general considerations may apply: • although a DC is experiment-specific, maybe its conclusions are not • an “experiment-specific” problem is better addressed if conceived as a “shared” one in a shared Tier-1 • an experiment DC just provides hints, real work gives insight crucial role of the experiments at the Tier-1 • find weaknesses of CASTOR MSS system in particular operating conditions • stress-test new LSF farm with official production jobs by CMS • testing DNS-based load-balancing by serving data for production and/or analysis from CMS disk-servers • test new components, newly installed/upgraded Grid tools, etc… • find bottleneck and scalability problems in DB services • give feedback on monitoring and accounting activities • …
T1 today: farming What changed since DC04? • Migration in progress: • OS • RH v.7.3 SLC v.3.0.4 • middleware • upgrade to LCG v.2.4.0 • install/manage WNs/servers • lcfgng Quattor • integration LCG-Quattor • batch scheduler • Torque+Maui LSF v.6.0 • queues for prod/anal • manage Grid interfacing RUNNING PENDING Total nb. jobs Max nb. slots • Analysis • “controlled” and “fake” (DC04) vs. “unpredictable” and “real” (now) • T1 provides one full LCG site + 2 dedicated RBs/bdII + support to CRABers • Interoperability: always an issue, even harder in a transition period • dealing with ~2-3 sub-farms in use by ~10 exps (in prod) • resource use optimization: still to be achieved see [ N.DeFilippis session II day 3]
T1 today: storage What changed since DC04? see [ P.P.Ricci session II day 3] • Storage issues (1/2): disks • driven by requirements of LHC data processing at the Tier-1 • i.e. simultaneous access of ~PBs of data from ~1000 nodes at high rate • main focus is on robust, load-balanced, redundant solutions to grant proficient and stable data access to distributed users • namely: “make both sw and data accessible from jobs running on WNs” • remote access (gridftp) and local access (rfiod, xrootd, GPFS) services, afs/nfs to share exps’ sw on WNs, filesystems tests, specific problem solving in analysts’ daily operations, CNAF participation to SC2/3, etc. • a SAN approach with a parallel filesystem on-top looks promising • Storage issues (2/2): tapes • CMS DC04 helped to focus some problems: • LTO-2 drives not efficiently used by exps in production at T1 • performance degradation increases as file size decreases • hangs on locate/fskip after ~100 not-sequential reading • not-full tapes are labelled ‘RDONLY’ after 50-100 GB written only • CASTOR performances increase with clever pre-staging of files • some reliability achieved only on sequential/pre-staged reading • solutions? • from the HSM sw side: fix coming with CASTOR v.2 (Q2 2005)? • from the HSM hw side: test 9940b drives in prod (see PIC T1) • from the exp side: explore possible solutions • e.g. file-merging in coupling PhEDEx tool to CMS production system • e.g. depict a pure-disk buffer in front of MSS disantangled from CASTOR
gw/UI UI Current CMS set-up at the Tier-1 CMS activities at the Tier-1 Castor MSS local CMS Local prod Resources manag. remote “access” shared Grid prod/anal PhEDEx agents “control” logical grouping CPUs Castor disk buffer SE Overflow Operations control WN WN WN Import-Export Buffer WN WN WN PhEDEx agents WN WN WN SE Analysis disks WN WN WN LSF SE Core Production disks WN WN WN CE WN WN WN Grid.it / LCG layer
PhEDEx in CMS • PhEDEx (Physics Experiment Data Export) used by CMS • overall infrastructure for data transfer management in CMS • allocation and transfers of CMS physics data among Tier-0/1/2’s • different datasets move on bidirectional routes among Regional Centers • data should reside on SEs (e.g. gsiftp or srm protocols) • components: • TMDB from DC04 • files, topology, subscriptions... • coherent set of sw agents, loosely coupled, inter-operating and communicating with TMDB blackboard • e.g. agents for data allocation (based on site data subscriptions), file import/export, migration to MSS, routing (based on implemented topologies), monitoring, etc… INFN T1 mainly on data transfer… INFN T1 mainly on prod/anal • born, and growing fast… • >70 TB known to PhEDEx, >150 TB total replicated
PhEDEx transfer rates T0INFN T1 CNAF T1 diskserver I/O weekly daily Rate out of CERN Tier-0
PhEDEx at INFN • INFN-CNAF is a T1 ‘node’ in PhEDEx • CMS DC04 experience was crucial to start-up PhEDEX in INFN • CNAF node operational since the beginning • First phase (Q3/4 2004): • Agent code development + focus on operations: T0T1 transfers • >1 TB/day T0T1 demonstrated feasible • … but the aim is not to achieve peaks, but to sustain them in normal operations • Second phase (Q1 2005): • PhEDEx deployment in INFN to Tier-n, n>1: • “distributed” topology scenario • Tier-n agents run at remote sites, not at the T1: know-how required, T1 support • already operational at Legnaro, Pisa, Bari, Bologna An example: data flow to T2’s in daily operations (here: a test with ~2000 files, 90 GB, with no optimization) ~450 Mbps CNAF T1 LNL-T2 ~205 Mbps CNAF T1 Pisa-T2 • Third phase (Q>1 2005): • Many issues.. e.g. stability of service, dynamic routing, coupling PhEDEx to CMS official production system, PhEDEx involvement in SC3-phaseII, etc…
CMS MonteCarlo productions • CMS production system evolving into a permanent effort • strong contribution of INFN T1 to CMS productions • 252 ‘assignments’ in PCP-DC04, for all production step [both local and Grid] • plenty of assignments (simulation only) now running on LCG (Italy+Spain) • CNAF support for ‘direct’ submitters + backup SEs provided for Spain • currently, digitization/DST efficiently run locally (mostly at T1) • produced data hence injected in the CMS data distribution infrastructure • future of T1 productions: rounds of “scheduled” reprocessing ~12.9 Mevts assegnati ~11.8 Mevts prodotti DST production at INFN T1
coming next: Service Challenge (SC3) • data transfer and data serving in real use-cases • review existing infrastructure/tools and give a boost • details of the challenge are currently under definition • Two phases: • Jul05: SC3 “throughput” phase • Tier-0/1/2 simultaneous import/export, MSS involved • move real files, store on real hw • >Sep05: SC3 “service” phase • small scale replica of the overall system • modest throughput, main focus is on testing in a quite complete environment, with all the crucial components • space for experiment-specific tests and inputs • Goals • test crucial components, push to prod-quality, and measure. • towards the next production service • INFN T1 participated in SC2, and is joining SC3
Conclusions • INFN-CNAF T1 is quite young but ramping-up towards stable production-quality services • optimized use of resources + interfaces to the Grid • policy/HR to support experiments at the Tier-1 • the Tier-1 actively partecipated to CMS DC04 • good hints: identified bottlenecks in managing resources, scalability, … • Learn the lessons: overall revision of CMS set-up at the T1 • involves both Grid and non-Grid access • first results are encouraging, success of daily operations • local/Grid productions + distributed analysis are running… • Go ahead: • long path… • next step on it: preparation for SC3, also with CMS applications
PhEDEx transfer rates T0INFN T1 CNAF T1 diskserver I/O weekly daily Rate out of CERN Tier-0
PhEDEx transfer rates T0INFN T1 CNAF T1 diskserver I/O weekly daily Rate out of CERN Tier-0
Lethal injuries only CNAF “autopsy” of DC04 • Agents drain data from SE-EB down to CNAF/PIC T1’s and • land directly on a Castor SE buffer • it occurred that in DC04 these files were many and small • So: for any file on the Castor SE fs, a tape migration is • foreseen with a given policy, regardless of their size/nb.. this strongly affected data transfer at CNAF T1 (MSS below is STK tape lib with LTO-2 tapes) • Castor stager scalability issues many small files (mostly 500B-50kB) stager db bad performances of stager db for >300-400k entries (may need more RAM?) • CNAF fast set-up of an additional stager in DC04: basically worked • REP-Agent cloned to transparently continue replication to disk-SEs • tape library LTO-2 issues high nb. segments on tape bad tape read/write performances, LTO-2 SCSI errors, repositioning failures, slow migration to tape and delays in the TMDB “SAFE”-labelling, inefficient tape space usage A–posteriori solutions:consider a disk-based Import Buffer in front of MSS… [ see next slide ] DC04
Non-lethal injuries CNAF “autopsy” of DC04 • minor (?) Castor/tape-library issues • Castor filename length (more info: Castor ticket CT196717) • ext3 file-system corruption on a partition of the old stager • tapes blocked in the library • several crashes/hanging of the TRA-Agent (rate: ~ 3 times per week) created from time to time some backlogs, nevertheless fast to be recovered • post-mortem analysis in progress • experience with the Replica Manager interface e.g. files of size 0 created at destination when trying to replicate from Castor SE some data which are temporarily not accessible for stager (or other) problems on the Castor side • needs further tests to achieve reproducibility and then Savannah reports • Globus-MDS Information System instabilities (rate: ~ once per week) some temporary stop of data transfer (i.e. ‘no SE found’ means ‘no replicas’) • RLS instabilities (rate: ~ once per week) some temporary stop of data transfer (cannot both list replicas and (de)register files) • SCSI driver problems on CNAF disk-SE (rate: just once but affected fake-analysis) disks mounted but no I/O: under investigation constant and painful debugging…
CMS DC04: number and sizes of files DC04 data time window: 51 (+3) days March 11th – May 3rd >3k files for >750 GB May 1st May 2nd Global CNAF network activity ~340 Mbps (>42 MB/s) sustained for ~5 hours (max was 383.8 Mbps)
POOL RLS catalogue Description of RLS usage Local POOL catalogue TMDB Tier-1 Transfer agent SRB GMCAT Replica Manager RM/SRM/SRB EB agents 4. Copy files to Tier-1’s Resource Broker 3. Copy/delete files to/from export buffers 5. Submit analysis job LCG ORCA Analysis Job Configuration agent 2. Find Tier-1 Location (based on metadata) 6. Process DST and register private data CNAF RLS replica 1. Register Files XML Publication Agent ORACLE mirroring Specific client tools: POOL CLI, Replica Manager CLI, C++ LRC API based programs, LRC java API tools (SRB/GMCAT), Resource Broker
Tier-0 data distrib. agents EB Tier-0 in DC04 Architecture built on: Tier-0 • Systems • LSF batch system • 3 racks, 44 nodes each, dedicated: tot 264 CPUs • Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT • Dedicated cmsdc04 batch queue, 500 RUN-slots • Disk servers: • DC04 dedicated stager, with 2 pools • 2 pools: IB and GDB, 10 + 4 TB • Export Buffers • EB-SRM ( 4 servers, 4.2 TB total ) • EB-SRB ( 4 servers, 4.2 TB total ) • EB-SE ( 3 servers, 3.1 TB total ) • Databases • RLS (Replica Location Service) • TMDB (Transfer Management DB) • Transfer steering • Agents steering data transfers • on a dedicated node (close monitoring..) • Monitoring Services GDB ORCA RECO Job RefDB IB TMDB fake on-line process POOL RLS catalogue Castor
CMS Production tools • CMS production tools (OCTOPUS) • RefDB • Contains production requests with all needed parameters to produce the dataset and the details about the production process • MCRunJob • Evolution of IMPALA: more modular (plug-in approach) • Tool/framework for job preparation and job submission • BOSS • Real-time job-dependent parameter tracking. The running job standard output/error are intercepted and filtered information are stored in BOSS database. The remote updator is based on MySQL but a remote updator based on R-GMA is being developed.