200 likes | 294 Views
CERN Data Services Update. HEPiX 2004 / NeSC Edinburgh Data Services team: Vladim ír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith. Outline. Data Services Drivers Disk Service Migration to Quattor / LEMON Future directions
E N D
CERN Data ServicesUpdate HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith
Outline • Data Services Drivers • Disk Service • Migration to Quattor / LEMON • Future directions • Tape Service • Media migration • Future directions • Grid Data Services CERN Data Services: Tim.Smith@cern.ch
Data Flows • Tier-0 / Tier-1 for the LHC • Data Challenges: • CMSDC04 (finished) ; PCP05 (Autumn) +80; +170 • ALICE ongoing +137 TB • LHCb ramping up +40 TB • ATLAS ramping up +60 TB • Fixed Target Programme: • NA48 at 80 MB/s +200 TB • COMPASS at 70 MB/s (peak 120) +625 TB • nToF at 45 MB/s +180 TB • NA60 at 15 MB/s +60 TB • Testbeams at 1~5 MB/s (x 5) • Analysis… CERN Data Services: Tim.Smith@cern.ch
Disk Server Functions CERN Data Services: Tim.Smith@cern.ch
Generations 1st & 2nd 4U 3rd & 4th 8U 0th Jumbos CERN Data Services: Tim.Smith@cern.ch
Warrantees CERN Data Services: Tim.Smith@cern.ch
Disk Servers: Jan 2004 • 370 EIDE Disk Servers • Commodity Storage in a box • 544 TB of disk capacity • 6700 spinning disks • Storage Configuration • HW Raid-1 mirrored for “maximum reliability” • ext2 file systems • Operating systems • RH6.1, 6.2, 7.2, 7.3, RHES • 13 different kernels • Application uniformity; CASTOR SW CERN Data Services: Tim.Smith@cern.ch
Quattor-ising • Motivation: Scale • Uniformity; Manageability; Automation • Configuration Description (into CDB) • HW and SW; nodes and services • Reinstallation • Production machines – min service interruption! • Eliminate peculiarities from CASTOR nodes • MySQL, web servers • Refocus root control • Quiescing a disk server ≠ draining a batch node! • Gigabit cards gymnastics • (ext2 -> ext3) • Complete (except 10 RH6 boxes for Objectivity) CERN Data Services: Tim.Smith@cern.ch
LEMON-ising • MSA everywhere • Linux box monitoring and alarms • Automatic HW static checks • Adding • CASTOR server specific • Service monitoring • HW Monitoring • lm_sensors (see tape section) • smartmontools • smartd deployment • Kernel issues; firmware bugs; through 3ware controller • smart_ctl auto checks; predictive monitoring • IPMI investigations; especially remote access • Remote reset/power-on/power-off CERN Data Services: Tim.Smith@cern.ch
Failure rate unacceptably high 10 months to be believed 4 weeks to execute 1224 disks exchanged (out of 6700) And the cages Western Digital; type DUA Head instabilities Disk Replacement CERN Data Services: Tim.Smith@cern.ch
Disk Storage Futures • EIDE Commodity storage in a box • Production systems • HW Raid-1 / ext3 • Pilots (15 production systems) • HW Raid-5 + SW Raid-0 / XFS • (See Jan Iven’s talk next) • New tenders out… • 30TB SATA in a box • 30TB external SATA disk arrays • New CASTOR stager (see Olof’s talk) CERN Data Services: Tim.Smith@cern.ch
Tape Service • 70 tape servers (Linux) • (mostly) Single FibreChannel attached drives • 2 symmetric robotic installations • 5 x STK 9310 Silos in each Drives Media CERN Data Services: Tim.Smith@cern.ch
lm_sensors package General SMBus access and hardware monitoring. Used to access LM87 chip Fan speeds Voltages Int/Ext temperatures ADM1023 chip Int/Ext temperatures Tape Server Temperatures CERN Data Services: Tim.Smith@cern.ch
Tape Server Temperatures CERN Data Services: Tim.Smith@cern.ch
Media Migration • To 9940B (mainly from 9940A) • 200GB – extra capacity avoids unnecessary acquisitions • Better performance – though hard to benefit in normal chaotic mode • Reduced errors; fewer interventions • 1-2% of A tapes can not be read (extremely slow) on B drives • Have not been able to return all A-drives CERN Data Services: Tim.Smith@cern.ch
Tape Service Developments • Removing tails… • Tracking of all tape errors (18 months) • Retiring of problematic media • Proactive retiring of heavily used media (>5000 mounts) • repack on new media • Checksums • Populated writing to tape • Verified loading back to disk • 22% already after few weeks CERN Data Services: Tim.Smith@cern.ch
Water Cooled Tapes! • Plumbing error! • 5000 tapes disabled for a few days • 550 superficially wet • 152 seriously wet – visually inspected CERN Data Services: Tim.Smith@cern.ch
Tape Storage Futures • Commodity drive studies • LTO-2 (Collaboratively CASPUR/Valencia) • Test and evaluate High-end drives • IBM 3592 • STK NGD • Other STK offerings • SL8500 robotics and silos • Indigo; managed storage, tape virtualisation CERN Data Services: Tim.Smith@cern.ch
GRID Data Management • GridFTP + SRM servers (Former) • Standalone / experiment dedicated • Hard to intervene; not scalable • New load-balanced 6 node Service • castorgrid.cern.ch • SRM modifications to support operate behind load balancer • GridFTP standalone client • Retire ftp and bbftp access to CASTOR CERN Data Services: Tim.Smith@cern.ch