Repack and Tape Label Options Tim Bell Charles Curran Gordon Lee June 27 th 2008

Repack and Tape Label Options Tim Bell Charles Curran Gordon Lee June 27th 2008

The Bulk Repack Problem • IBM and Sun have new drives coming • Aim for production at CERN in January • Higher capacity (1TB per tape) • Faster drives (up to 160MBytes/s) • Require repacking to avoid buying new media and robot slots • Current dataset • 104 million files • 15PB storage • 39000 tapes 2

Why are we copying ? • Cost to purchase is the additional media and slots required if we write at new densities but do not copy and recycle old tapes • Adds up to a saving of 3.2M CHF • With higher density and repack, media requirements for 2009 are covered • Without higher density, 2 new 10,000 slot robots would be required in 2009 3

Per-VO file sizes • Some improvements in file sizes from LHC experiments over the past 6 months but no major revolution expected • Current average is 154MBytes per file 4

Per Tape Distribution • Long tail up to 154,000 files per tape • Only 25% of tapes have average file size >1 Gbyte • Projected year end 2008 based on LHC usage 5

Castor tape formats A B C Castor AUL H M A M T M H M B M T M H M C T M NL A M B M C M 6

File size and performance • AUL shows 7.3 seconds overhead per file • NL shows 3.3 seconds overhead per file • Tests using low level tape to tape copy are covered by read/cksum/write • Figures confirmed by running repack2 and Castor to aul and nl tapes 7

Repack in a year • This is the number of drives which would need to be dedicated to complete the repack within 1 year. • The write performance varies with different output label types • Includes projected data to year end 2008 • Drive costs around 35K CHF over 3 years 8

Ignore worst cases • Determine drive requirements if we ignore the projected 6000 tapes with >10000 files • Leave worst cases in the robot unpacked (i.e. Cost of 0.5MCHF for 3000 more tapes/slots/robots) 9

Repack using 20 drives • Approach to take easy tapes with large files first • Repack using aul tapes would take over 3 years to complete • Max80 figures reflect the performance if engine is able to sustain reading at 80MBytes/s. Max50 for 50MBytes/s and Max25 for 25MBytes/s • The ‘to migrate’ queue would be around 400,000 files at the end of processing if 20 drives are used. 10

IL – Internal Label Format • New format of data on tape to reduce the number of file marks • Stores data located by block offset rather than file sequence number • Tape mark only at the end of the migration stream rather than end of each file • Simple prototype copy program has produced 85MBytes/s. Full drive speed can be achieved if shared buffers used. • This label format is new and is therefore not currently supported by Castor 11

IL tape format A B C Castor H M A M T M H M B M T M H M C T M AUL NL A M B M C M IL M 12

Intermediate Conclusion • Given the file sizes and drives currently being used, the label format is the limiting factor for performance • The engine used for copying is a secondary performance factor. This factor becomes more important for label formats or file sizes which support higher speeds such as 50MB/s or more. • Scanning tapes at full drive speed can be used to validate a complete repack commit to the name server 13

Option A – bulk repack • Need a new low level label format using block addressing to write many castor files without tape marks • Develop a new low level repack program which writes out in il format using direct tape to tape copy with two tape drives on a tape server • Enhance Castor to support reading il format in the short term • Writing il format requires modifications to rtcpd/rtcpclientd as current writing is file-by-file and il requires a full stream. This is unlikely before clustering implementation is done so continue to write new data in aul format until clustering implementation is complete which will require rework in this area. 14

Option B - clustering • Architecture task force recommended to cluster related data onto tape. • One possible implementation of this would be to merge many related Castor files into a single large file when migrating to tape and recalled as a unit. • Start using the repack2 engine at maximum speed and aul tapes on tapes with large files until clustering is available • Once clustering is available, repack many tapes in parallel to allow related files to be grouped together on tape for more efficient recall. • Need at least 30 disk servers for production repack service class to ensure reasonable clustering and drive performance. • Cluster implementation needs to be architected, implemented, policies defined and deployed at very latest by end 1Q 2009 to avoid delays in the repacking process. 15

Option C – tape to tape copy • Develop a new low level repack program which is able to write nl tape format output using direct tape to tape copy with two tape drives on a tape server • Write in nl format and partial re-scan of tapes on completion to validate contents • 80% of tapes (giving 14PB additional space) can be completed in 1 year with 25 drives which may be sufficient for 2009 data 16

Costing • Option A – bulk repack • Development for • Bulk repack tool • Support of new label format for read in Castor • Name server fields for block offset ? • 22 drives for 1 year • Option B – start repack2/aul then clustering • Development for • 2nd level disk hierarchy • Legacy cluster definitions • Hardware • 33 disk servers @ 8K CHF / disk server dedicated for one year • Fat tape servers purchase required ? • 33 drives for 1 year • Option C – copy only good cases to nl • Development for • Bulk repack tool • 25 drives for 1 year • Purchase 3000 additional slots (0.5M CHF) 17

Points • What tool for repack 2010/11? • Must repack all of the 50PB data in 2011 to new media • 10Gbit/s ethernet and drives at 160MBytes/s • Do we still need a low level tool anyway even if clustering can be used ? • Can we avoid the repack2 restrictions on number of concurrent files being processed and submitted to the stager ? • What risk with new tape IL format ? • Complete testing before EOY 2008 • Nameserver/stager changes for block offset • What risk with nl format ? • If tapes are appended to, tape drive malfunction may overwrite data • Write to the tapes once only, scan and then commit to reduce nl risk • Test recovery program based on name server checksums • What risk with new bulk tool ? • How can we test it ? Scanning tool is also required for validation • What risk for clustering deliverable ? • Architecture, will multiple user files per tape file be selected ? • Additional hardware for disk layer / fat tape layer • Define experiment and legacy clusters • Schedule is critical for repack success .. Emergency orders for tape capacity 18

Points (contd) • How many drives can we spare ? • Need to get underway during low data recording periods • Further drive purchase ? Use old drives for reading ? • More drives means more load on the stager as queues longer • Can we reduce read mounting in the future by repack/clustering ? • Use repack as a rebalancing tool by reading in several tapes and re-clustering • What is the access frequency for older LHC data ? • Is the disk layer large enough to be able to effectively cluster on repack ? • What are the relative efforts ? • Developing new clustering solutions ? Needs to be done anyway but the repack requirements may bring time pressure • Investment to tune repack2 to get the necessary throughput and robustness will need to continue and occupy substantial development resources • The low level tool would require scripting and a method to track outstanding work similar to that used for repack-1 19

Conclusion • ? 20

Backup Slides

What is in an AUL label 22

What is in AUL / UHL 1 ? 23

Repack using 20 drives • Full extended timeline showing aul,max25 to completion 27

Performance for large files 28

Tape-to-Tape repack? CERN repack Stager Disk Server • Tape-to-tape copy rather than copying through the stager avoids network bottleneck • Initial tests indicate that the tape writing overheads are larger for our typical files 29

Tests to scale repack 2 • 3 disk servers • 3 tape drives in • 3 tape drives out • File size of 2GB+ • Elapsed 3h for 1500GB, 46MBytes/s • Around 60MBytes/s during steady state • 6 disk servers • 3 tape drives in • 3 tape drives out • File size of 500MB+ 30 30

Tests to scale repack 2 • 3 disk servers • 1 tape drive in • 1 tape drive out • Reaches Gigabit ethernet wire speeds 31 31

c2public small files • Migrated 400,000 files in 18 days • Two drives • Two disk servers • Using a mixture of nl and aul tapes on IBM drives • Corresponds to a file / drive every 8 seconds 32

File size and performance 33

Additional Information • Repack Options • https://twiki.cern.ch/twiki/bin/view/FIOgroup/TapeBulkRepack • Repack Performance Analysis • http://it-div-ds.web.cern.ch/it-div-ds/HO/repack_challenge.html • Label Options • https://twiki.cern.ch/twiki/bin/view/FIOgroup/TapeLabelOptions 34

Repack and Tape Label Options Tim Bell Charles Curran Gordon Lee June 27 th 2008