1 / 19

CASTOR 2.1.9 Upgrade, Testing and Issues

CASTOR 2.1.9 Upgrade, Testing and Issues. Shaun de Witt GRIDPP-25 23 August 2010. Agenda. Testing What we Planned, what we did and what the VOs are doing Results Issues Rollout Plan The Future. Planned Testing. Original Plan Test database Upgrade Procedure Functional Test 2.1.7/8/9

gaurav
Download Presentation

CASTOR 2.1.9 Upgrade, Testing and Issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CASTOR 2.1.9 Upgrade, Testing and Issues Shaun de Witt GRIDPP-25 23 August 2010

  2. Agenda • Testing • What we Planned, what we did and what the VOs are doing • Results • Issues • Rollout Plan • The Future

  3. Planned Testing • Original Plan • Test database Upgrade Procedure • Functional Test 2.1.7/8/9 • Stress Test 2.1.7/8/9 • 10K reads (1 file in, multiple reads) (rfio+gridFTP) • 10K writes (multiple files in)(rfio+gsiftp) • 10K d-2-d (1 file in, multiple reads) (rfio) • 20K read/write (rfio+gridFTP), 10K mixed tests • 10K stager_qry (database test) • 5 file sizes (100MB-2GB)

  4. Required Changes • Move to ‘local’ nameserver • Required to allow rolling updates • Nameserver schema can not be upgraded until all instances are at 2.1.9 • Move from SLC4 to SL4 • Support for SLC4 end this year • SL4 supported until 2012 • Change of diskservers part way through testing

  5. Actual Testing • (*) Indicates a schema only upgrade; the rpm’s remained at the previous version • (†) Move from SLC4 to SL4 after stress testing

  6. Actual Stress Testing • Original plan for fix would have taken too long • Moved to fixed duration testing (24 hr limit) • Reduced number of file sizes from 5 to 2 • 100 MB and 2GB • No mixed tests

  7. Results • All 2.1.8 Functional Tests pass • Most 2.1.9 tests pass • With some modifications to scripts • Including xrootd! • Some fail because they require a CERN specific set up • Stable under stress testing • Changes made performance metrics less useful • Overall impression is no significant change

  8. Issues (on Testing) • Limit on clients • More stress on client machines than CASTOR • Unable to test extreme LSF queues • VO testing includes stress (hammercloud) tests • Functional tests done with ‘matching’ client version • Some basic testing also done with older client versions (2.1.7) against later stager versions. • VO’s using 2.1.7 clients

  9. Issues (on CASTOR) • Remarkably few.... • DLF not registering file id • Fixed by CERN – we need custom version of DLF.py • No 32-bit xroot rpms available • Produced for us, but not fully supported • gridFTP external (used @ RAL) does not support checksumming • Some database cleanup needed before upgrade

  10. Issues (VO Testing) • Some misconfigured disk servers • Problems with xrootd for ALICE • Disk servers need firewall ports opening.

  11. Issues (in 2.1.9-6) • Known issues affecting 2.1.9-6 • Rare checksum bug affecting gridFTP internal • Fixed in 2.1.9-8 • Can get file inconsistencies during repack if file is overwritten • Very unlikely (fixed in 2.1.9-7) • Xrootd manager core dumps at CERN • Under investigation • Problem with multiple tape copies on file update

  12. Change Control • Whole testing and rollout plan has been extensively change reviewed • Four separate reviews, some done independently of CASTOR team • Included review of Update Process • Provided useful input for additional tests and highlighted limitations, and identifying impacted systems • Proposed regular reviews during upgrades • Detailed update plan under development

  13. Rollout Plan • High level docs available for some time now: • https://www.gridpp.ac.uk/wiki/RAL_Tier1_Upgrade_Plan • Three downtimes • Schedule to be agreed with VO’s • Proposed schedule sent to VO’s • Likely LHCb will be guinea pigs • ALICE before Heavy Ion run

  14. Schedule (draft) • Rolling move to local nameserver starting 13/9 • Main update: • LHCb: 27/9 • GEN(ALICE): 25/10 • ATLAS: 8/11 • CMS: 22/11 • Revert back to central n/s post Xmas

  15. The Future • More CASTOR/SRM upgrades • 2.1.9-8 to address known issues • 2.9 SRM more performant, safer against DoS • Move to SL5 • Probably next year; no rpm’s available yet • CASTOR gridFTP ‘internal’ • More use of xrootd • More stable database infrastructure (Q1 2011?)

  16. Facilities Instance • Provide CASTOR instance for STFC facilities • Provides (proven) massively scalable “back end” storage component of a deeper data management architectural stack • CASTOR for STFC facilities: production system to be deployed ~ Dec 2010 • STFC friendly users currently experimenting with CASTOR • Users expected to interface to CASTOR via “Storage-D” (High performance data management pipeline) • E-Science aiming for a common architecture for “big data management”: • CASTOR Back end data storage • Storage-D middleware • ICAT file and meta-data catalogue • TopCat – multi user web access • Can eventually wind down sterling, (but obscure) “ADS” service (very limited expertise, non Linux operating system, unknown code in many parts) • Exploits current (and future) skill set of the group

  17. Summary • New CASTOR was stable under stress testing • And VO testing – so far • Performance not impacted – probably • Very useful getting experiments on-board for testing. • ‘Ready’ for deployment

  18. Results (Stress Tests, 100MB)

  19. Results (Stress Tests, 2GB)

More Related