SRM CCRC-08 and Beyond

SRM CCRC-08 and Beyond Shaun de Witt CASTOR Face-to-Face

Introduction • Problems in 1.3-X • And what we are doing about them • Positives • Setups • Recommendations • Future Developments • Release Procedures

Problems - Database • Deadlocks • Observed at CERN and ASGC (CNAF too?) • Not at RAL – not sure why??? • Two types (loosely) • Daemon/daemon deadlocks • Server/Daemon deadlocks • Startup problems • Too many connections • ORA-0600 errors

Daemon/Daemon deadlocks • Found ‘accidentally’ at CERN • Caused by multiple back-ends running talking to the same database • Leads to database deadlocks in GC • In 2.7 GC has moved into database as a procedure. • Could be ported to 1.3, but not planned

Server/Daemon deadlocks • Caused by using CASTOR fillObj() API • When filling subrequests, multiple calls can lead to two threads blocking one another. • Daemon and Server both need to check status and possibly modify subrequest info. • Solution proposed is to take lock on the request • This would stop deadlocks • But could lead to lengthy locks

Problems - Database • Start up problems • Seen often at CNAF, infrequently at RAL. • TNS – ‘no listener’ error. • Need to check logs at startup. • No solution at the moment • Restarting cures problem. • Could add monitoring to watch for this error.

Problems - Database • Too many connections • Seen at CERN • Partly down to configuration • Many SRMs talking to the same database instance. • Two solutions • More database hardware • Fewer SRMs on same instance • But expensive • Reduce Threads on server and daemon • May cause TCP timeout errors under load (server) or cause put/get requests to be processed too slowly (daemon) • More on configuration later

Problems - Database • ORA-0600 (Internal Error) problems • Seen at RAL and CERN • Oracle internal error • Will render SRM useless • Fix available from ORACLE • RAL has not seen it since applying fix • Gordon Brown at RAL can provide details

Problems - Network • Intermittent CGSI errors • Terminal CGSI errors • SRM ‘lock-ups’

Problems - Network • Intermittent CGSI-gSOAP errors • cgsi-gSOAP errors reported in logs and to client • Seen 2-10 times per hour (at RAL) • Correlation in time between front-ends • Both will get an error at about the same time • Cause is unclear • No solution at the moment • Seems to happen < 0.1% of requests at RAL

Problems - Network • Terminal CGSI-gSOAP errors • All threads end up returning CGSI-gSOAP errors • Can affect only 1 of front ends • Cause unknown • Does not seem correlated with load or request type • No solution at moment • ASGC site report indicated may be correlated with database deadlocks(?) • Need monitoring to detect this in the log file • Restart of effected front end normally clears problem. • New version of CGSI plug-in available, but not yet tested

Problems - Network • SRM becomes unresponsive • Debugging indicates all threads stuck in recv() • Cause unknown • May have been cause of ATLAS ‘blackouts’ during first CCRC • New releases include recv() and send() timeouts • Should stop this • Two new configurable parameters in srm2.conf

Problems - Other • Interactions with CASTOR • Behaviour when CASTOR is slow • Needless RFIO calls loading job slots • Bulk Removal requests • Use of MSG field in DLF

Problems - Other • Behaviour when CASTOR becomes slow • See error “Too many threads busy with CASTOR” • Can block new requests coming in • But useful diagnostic of CASTOR problems • Solution is to decrease STAGERTIMEOUT in srm2.conf • Default 900 secs too long • Most clients give up after 180 secs • No ‘hard and fast’ rule about what it should be • Somewhere between 60 and 180 is best guess. • Pin time • Implementation ‘miscommunication’ – top heavy a weight applied • Fixed in 1.3-27 • Also reduce Pin Lifetime in srm2.conf

Problems - other • Needless RFIO calls • Identified by CERN • Takes up jobs slots on CASTOR • Timeout after 60 seconds • On all GETS without a space token • Introduced when support for multiple default spaces was introduced • Fix already in CVS • For release 2.7 • Duplicates code when space token provided • Could be backported to 1.3

Problems - other • Bulk removal requests • Sometime produce CGSI-gSOAP errors for large numbers of files (>50) • But deletion does work – problem on send()? • May be load related • On one day 4/6 tests with 100 files produced this error • The next day 0/6 tests with 1000 files produced this error • Some discussion about removing stager_rm and just do nsrm • May help speed up processing • But would leave more work for CASTOR cleaning daemon

Problems - Other • Lots of MSG fields left blank • Problem for monitoring • Addressed in 2.7 • Will not be back ported. • Occasional crashes • Traced to use of strtok (not _r) • Fixed in 1.3-27

Positives • Request rate • At RAL on 1 cms front end with 50 threads: • 21K requests/hr • Distribution of type of request not known. • Processing speed • Again using CMS at RAL • Daemon running 10/5 threads • Put requests in 1-5 seconds • Same for GET requests w/o tape recall

Positives • Front end quite stable • At RAL few interventions required

SETUPS • Different sites have different hardware set ups • Hope you can fill the gaps…!

RAL Setup SRM-CMS SRM-ATLAS SRM-LHCb SRM-ALICE 3 Node RAC

CERN Setup srm-cms srm-alice srm-dteam srm-ops Single Machine shared-db atlas-db lhcb-db srm-atlas srm-lhcb

CNAF Setup srm-cms srm-shared cms-db shared-db Single Machine

ASGC Setup srm srm-db castor-db dlf-db 3 node RAC

Useful Configuration Parameters • Based on you setup, you will need to tune some or all of the following parameters: • SERVERTHREADS • CASTORTHREADS • REQTHREADS • POLLTHREADS • COPYTHREADS • The more instances on a single database instance, the fewer threads should be assigned to the SRM • Need to balance request and processing rates on daemon and server • SOAPBACKLOG • SOAPRECVTIMEOUT • SOAPSENDTIMEOUT • Number of SOAP requests, and timeouts related to recv() and send() • Best ‘guesstimate’ for these are 100, 60, 60 • TIMEOUT • Stager timeout in castor.conf • Best ‘guesstimate’ 60-180 seconds • PINTIME • Keep low

Future Developments • Move to SL4 • Move to castor clients 2.1.7 • New MoU

Move to SLC4 • URGENT • No support for SLC3 • Support effort for SL3 dwindling • Have built and tested one version • In 1.3 series • All new developments (2.7-X) on SL4 • No new development in 1.3 series

Move to 2.1.7 clients • URGENT • Addresses security vulnerability with regards to proxy certificates • Much better error messaging • Fewer ‘unknown error’ messages • 2.1.3 clients no longer supported or developed • Since this requires a schema change, releases in this series will be 2.7-X

New MoU • Major new features: • srmPurgeFromSpace • Used to remove disk copies from a space • Initial implementation will only remove files currently also on tape • VOMS based security • This will be implemented in CASTOR but may need changes to SRM/CASTOR interface.

Future Development Summary • New features will be put into 2.7-X or later releases. • 2.7-X releases only on SLC4 • Is port of 1.3-X to SLC4 required? • Esp. given security hole in 1.3 • Will require 2.1.7 clients installed on SRM nodes • Timescale? • End June. Tall order!

Release Procedures • Following problems just after CCRC • Srm seemed to pass all tests • But daemon failed immediately in production (CERN and RAL) • Brought about by a ‘simple’ change which only affected recalls when no space token was passed. • Clear need for additional tests before release • Public s2 not enough

Pre-Release Procedures • (Re) Developing shell test tool which will be delivered with the SRM. • To include basic tests of all SRM functions • Will include testing of tape recalls if possible (i.e. not if only using a Disk1Tape0 system) • New tests added when we find missing cases. • Will require tester to have certificate (i.e. can not be run as root) • Looking at running FULL s2 test suite • This includes tests of a number of invalid requests • Not normally run since VERY time consuming

Pre-Release Procedures • As now, s2 tests will be run over 1 week to try and ensure stability • Problem still is stress testing • No dedicated stress tests exist • But this is most likely to catch database problems. • Could develop simple ones • But would they be realistic enough?

SRM CCRC-08 and Beyond