190 likes | 356 Views
CASTOR 2.1.9 Upgrade, Testing and Issues. Shaun de Witt GRIDPP-25 23 August 2010. Agenda. Testing What we Planned, what we did and what the VOs are doing Results Issues Rollout Plan The Future. Planned Testing. Original Plan Test database Upgrade Procedure Functional Test 2.1.7/8/9
E N D
CASTOR 2.1.9 Upgrade, Testing and Issues Shaun de Witt GRIDPP-25 23 August 2010
Agenda • Testing • What we Planned, what we did and what the VOs are doing • Results • Issues • Rollout Plan • The Future
Planned Testing • Original Plan • Test database Upgrade Procedure • Functional Test 2.1.7/8/9 • Stress Test 2.1.7/8/9 • 10K reads (1 file in, multiple reads) (rfio+gridFTP) • 10K writes (multiple files in)(rfio+gsiftp) • 10K d-2-d (1 file in, multiple reads) (rfio) • 20K read/write (rfio+gridFTP), 10K mixed tests • 10K stager_qry (database test) • 5 file sizes (100MB-2GB)
Required Changes • Move to ‘local’ nameserver • Required to allow rolling updates • Nameserver schema can not be upgraded until all instances are at 2.1.9 • Move from SLC4 to SL4 • Support for SLC4 end this year • SL4 supported until 2012 • Change of diskservers part way through testing
Actual Testing • (*) Indicates a schema only upgrade; the rpm’s remained at the previous version • (†) Move from SLC4 to SL4 after stress testing
Actual Stress Testing • Original plan for fix would have taken too long • Moved to fixed duration testing (24 hr limit) • Reduced number of file sizes from 5 to 2 • 100 MB and 2GB • No mixed tests
Results • All 2.1.8 Functional Tests pass • Most 2.1.9 tests pass • With some modifications to scripts • Including xrootd! • Some fail because they require a CERN specific set up • Stable under stress testing • Changes made performance metrics less useful • Overall impression is no significant change
Issues (on Testing) • Limit on clients • More stress on client machines than CASTOR • Unable to test extreme LSF queues • VO testing includes stress (hammercloud) tests • Functional tests done with ‘matching’ client version • Some basic testing also done with older client versions (2.1.7) against later stager versions. • VO’s using 2.1.7 clients
Issues (on CASTOR) • Remarkably few.... • DLF not registering file id • Fixed by CERN – we need custom version of DLF.py • No 32-bit xroot rpms available • Produced for us, but not fully supported • gridFTP external (used @ RAL) does not support checksumming • Some database cleanup needed before upgrade
Issues (VO Testing) • Some misconfigured disk servers • Problems with xrootd for ALICE • Disk servers need firewall ports opening.
Issues (in 2.1.9-6) • Known issues affecting 2.1.9-6 • Rare checksum bug affecting gridFTP internal • Fixed in 2.1.9-8 • Can get file inconsistencies during repack if file is overwritten • Very unlikely (fixed in 2.1.9-7) • Xrootd manager core dumps at CERN • Under investigation • Problem with multiple tape copies on file update
Change Control • Whole testing and rollout plan has been extensively change reviewed • Four separate reviews, some done independently of CASTOR team • Included review of Update Process • Provided useful input for additional tests and highlighted limitations, and identifying impacted systems • Proposed regular reviews during upgrades • Detailed update plan under development
Rollout Plan • High level docs available for some time now: • https://www.gridpp.ac.uk/wiki/RAL_Tier1_Upgrade_Plan • Three downtimes • Schedule to be agreed with VO’s • Proposed schedule sent to VO’s • Likely LHCb will be guinea pigs • ALICE before Heavy Ion run
Schedule (draft) • Rolling move to local nameserver starting 13/9 • Main update: • LHCb: 27/9 • GEN(ALICE): 25/10 • ATLAS: 8/11 • CMS: 22/11 • Revert back to central n/s post Xmas
The Future • More CASTOR/SRM upgrades • 2.1.9-8 to address known issues • 2.9 SRM more performant, safer against DoS • Move to SL5 • Probably next year; no rpm’s available yet • CASTOR gridFTP ‘internal’ • More use of xrootd • More stable database infrastructure (Q1 2011?)
Facilities Instance • Provide CASTOR instance for STFC facilities • Provides (proven) massively scalable “back end” storage component of a deeper data management architectural stack • CASTOR for STFC facilities: production system to be deployed ~ Dec 2010 • STFC friendly users currently experimenting with CASTOR • Users expected to interface to CASTOR via “Storage-D” (High performance data management pipeline) • E-Science aiming for a common architecture for “big data management”: • CASTOR Back end data storage • Storage-D middleware • ICAT file and meta-data catalogue • TopCat – multi user web access • Can eventually wind down sterling, (but obscure) “ADS” service (very limited expertise, non Linux operating system, unknown code in many parts) • Exploits current (and future) skill set of the group
Summary • New CASTOR was stable under stress testing • And VO testing – so far • Performance not impacted – probably • Very useful getting experiments on-board for testing. • ‘Ready’ for deployment