CCRC’08 Review from a DM perspective

CCRC’08 Review from a DM perspective Alberto Pace (With slides from T.Bell, F.Donno, D.Duelmann, M.Kasemann, J.Shiers, …)

Presentation title - 2 Before the main topic • Safety reminder • The computer center has different safety requirements than normal offices • This is why authorization is needed to enter ! • This is why there are safety courses ! • Noise above level acceptable for long term work • Wind above level acceptable for long term work • False Floor – 1 meter deep ! • No differential power switch !! • In case of accident call the fire brigade

Presentation title - 3 CCRC’08 • Wiki site • https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCommonComputingReadinessChallenges • Ongoing challenge with all 4 experiments

Online and offline databases

CPU Usage ATLAS/CMS DBs

Physical Reads

Network traffic

DB service - some observations • In general: DB load still dominated by activities that did not scale-up significantly during CCRC • load changes by CCRC on monitoring, work-flow, production systems smaller than eg fluctuations between software releases • major contribution scaling with reconstruction jobs not yet visible at CERN and Tier 1 sites • Exception: ATLAS reprocessing at BNL, TRIUMF and NDGF • increased dCache load on to Calibration files (POOL) introduced bottleneck • Consequence: extremely long (idle) database connections on conditions database • CORAL failover between T1 sites worked • Increased DB session limits, session sniping added, dCache pool for calibration files added • DB service run smoothly and without major disruptions • As usual several node reboots • minor impact thanks to cluster architecture • 2h streams intervention (downstream capture) was scheduled in agreement with experiments and service coordination during CCRC

Castor and Grid Data Management

Tier-0 to Tier-1 Exports

February Summary http://gridview.cern.ch/

Not limited by Castor

Successful Stage-in test

SRM – 2 ... Working

TAPE issues

Total performance to tape • Alice and LHCb running Castor 2.1.4 without policies so around 100% improvement in write performance expected with 2.1.6 • With simulated file sizes, Atlas data rates have improved to 30MB/s writing • Focus on file size and policies has shown some improvements in write performance • Read efficiency remains low and dominates drive utilisation due to low number of files read per mount and non-production users

Tape usage read dominated • Random read dominates drive time (90% reading) • Writing under control of Castor policies • Reading much more difficult to improve from the Castor side

Production vs Users • Data retrieved for CCRC period for CMS • CMS production is under cmsprod and phedex (25% total) • Requests for tape recalls dominated by non-production • Equivalent data for Atlas shows production requests < 5%

Options • Do nothing • Hope things work out OK • Tape prioritization in Castor • complete minimum implementation of VDQM2 and tape queue prioritization • A new long term strategy may be necessary • Dedicate resources • Fragmentation risks • Hardware investment • Purchase 50 tape drives and servers • Cost is 15K CHF/drive and 6K CHF/tape server, total 1050 kCHF

Problems reported

Castor • Invalid checksum value returned by the CASTOR gridftp2 server (reported by CMS on 05/02) • FIXED in 1.13-11 (07/02) • Gsiftp TURLs returned by CASTOR are relative (reported by S2 and CMS on 06/02) • FIXED in 1.13-11 (07/02) • Unable to map request to space for policy TRANSFER_WAN (reported by CMS on 07/02) • FIXED in 1.13-13 (08/02) • The srmDaemon attempts to free an unallocated pointer and crashes (reported by CNAF) • FIXED in 1.13-14 (14/02) • Some of the database at CERN have shown an index to be missing (found by S2). • FIXED in 1.3.10-1 (15/02) • Insufficient user privileges to make a request of type StagePutDoneRequest in service class 'atldata' (reported by S2 and ATLAS on 19/02) • PutDone executed by and allowed for (root,root) To be fixed • Workaround provided on 23/02

Castor • Missing access control on spaces based on voms groups and roles (reported by ATLAS/LHCb on 19/02). • Followed by Storage Solution WG • Could not get user information: VOMS credential ops does not match grid mapping dteam(reported by S2 and CNAF on 21/02) • Not yet understood • Error creating statement, Oracle code: 12154 ORA-12154: TNS:could not resolve the connect identifier specified (reported by S2 and CNAF on 12/02) • Not yet understood • It happens at service startup. A restart cures the problem • Server unresponsive at RAL? - Space token ATLASDATADISK does not exist (reported by S2 and ATLAS on 28/02) • Number of threads increased from 100 to 150 (28/2)

Castor Summary • 10 software problems reported, no major problems • 6 problems fixed (in 2-3 days average) • Developers and operation people very responsive.

DPM • Default ACLs on directories do not work (reported by ATLAS on 13/02) • FIXED in 1.6.7-4 (certified) • Slow file removal (reported by ATLAS on 22/02): • ext3 filesystems much slower than xfs for delete operations (2048 files of 1.5GB removed in 90minutes against 5 seconds of xfs – tests performed on the 25/02) • DPM 1.6.10 is being certified and will be the release available for CCRC08 in May.

Conclusion • CCRC ’08 is a success so far • All DM software and tools has been able to scale to the challenge and beyond • All is well under control in both the database and data management areas • Remains strategic directions where investigations and major improvements or simplifications need discussion: • Improve efficiency for analysis • Tape area in general • Service for online database, piquet service for support • Synergies between DM tools and Castor • Job scheduling in Castor, improve/common database schema for Grid DM tools and Castor • ...

CCRC’08 Review from a DM perspective