Managing managed storage CERN Disk Server operations

Managing managed storageCERN Disk Server operations HEPiX 2004 / BNL Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith

Outline • Which are our “Data Services”? • Disk server hardware @ CERN • Management tools • What’s next? Jan.van.Eldik@cern.ch FIO/DS

A lot of hardware • Disk storage • 350 “storage in a box” Linux diskservers • 6700 disks • 550 TeraBytes of raw disk space • Tape storage • 2 robotic installationseach with 5 STK 9310 silos • 50 9940B drives, 14000 tapes, 2.8 PB • 20 9840 drives, 8000 tapes, 160 TB Jan.van.Eldik@cern.ch FIO/DS

Many applications • 200 CASTOR! • 40 Oracle • 20 CDR • 10 AFS scratch • dCache, LHC@home, … • LCG, OpenLab, EGEE, data challenges • 40 in repair/spare A very heterogeneous environment! And very dynamic too  Jan.van.Eldik@cern.ch FIO/DS

Players • Many teams involved: • Application responsibles / Users • Service managers • System administrators team • Suppliers • Software often not redundant…need to minimize downtime! • …so the hardware should be! Jan.van.Eldik@cern.ch FIO/DS

“Storage in a box” 13 different hardware configurations: • 8 – 26 IDE disks, hot-swappable trays • 2 – 4 3-Ware RAID controllers • 2 CPUs • 2 – 3 power supplies • GigE network card Should be redundant… Jan.van.Eldik@cern.ch FIO/DS

hardware interventions • 55 interventions since Sep 1 • disk replacements (70%) • trays, cables, fans, PSU • 33% involve (un)scheduled downtime  • Older hardware harder to maintain • One supplier out of business • Incidents to spice up life… Jan.van.Eldik@cern.ch FIO/DS

4.5% 4.0% 1224 disks replaced 3.5% 3.0% % Broken Mirrors 2.5% 2.0% 1.5% 1.0% Jumbo servers 0.5% 0.0% Dec-03 Jan-04 Feb-04 Mar-04 Apr-04 May-04 Jun-04 Jul-04 Aug-04 Sep-04 Disk replacement Christmas • 10 months before case agreed: Head instabilities • 4 weeks to execute • 1224 disks exchanged (=18%); And the cages as well Jan.van.Eldik@cern.ch FIO/DS

65 Jumbo’s • 1 – 1.5 TB raw disk space • 6800 3-Ware controllers • 600 MHz PIII • No PXE • Becoming hard to maintain • Many still under warranty  • Make good mini-bars! Jan.van.Eldik@cern.ch FIO/DS

175 4U servers • 4U (5U) rack mounted • 1 – 1.5 TB • 2 * 3-Ware 7000 series currently upgrading firmware • 2 * 1 GHz PIII’s • No PXE(yet) • Various maintenance issues Jan.van.Eldik@cern.ch FIO/DS

115 8U servers • 8U rack mounted • 2 – 2.5 Tb • 3 – 4 * 3-Ware 7500(6)-8 • 2 * 2.4 GHz Xeon • Well controlled, well maintained, well behaved, after disk replacements  Jan.van.Eldik@cern.ch FIO/DS

Diskserver evolution Jan.van.Eldik@cern.ch FIO/DS

That was then… • HW RAID1 • Ext2 filesystems many of them • 13 different kernels! RedHat 6.1/6.2, 7.2/7.3, 2.1ES • Need for automation + standardization ELFms toolsuite Quattor – installation + configuration LEMON – performance + exception monitoring LEAF – Hardware and State Management Jan.van.Eldik@cern.ch FIO/DS

…this is now • RedHat 7.3, preparing for SLC3 • Oracle: RHEL 2.1, preparing RHEL 3kernel has old 3-Ware driver  • HW RAID5 + hot spare disk • Up to 50% more usable space • On 3-Ware 7000 controller with up-to-date firmware • SW RAID0 + XFS • Improved performance expected iozone benchmark • Old XFS version • Improved kernel / elevator tuning Jan.van.Eldik@cern.ch FIO/DS

Updating the toolbox • SMART – to predict disk failuredaily and weekly self-tests, on every disk • IPMI v1.5 • HW monitoring and event control • Power control, resets • Lm_sensors – temperature monitoring Hardware and software specific All data flows into Lemon repository Jan.van.Eldik@cern.ch FIO/DS

Wintertime? Jan.van.Eldik@cern.ch FIO/DS

This is now • Quattorized + Lemonized • Rely on Operator and SysAdmin teams • Operated in same way as PC farms • Getting more out of suppliersBIOS upgrade necessary for PXE enabling BTW: most applies to tapeservers as well Jan.van.Eldik@cern.ch FIO/DS

What’s next? • New hardware • 360 TB “SATA in a box”, 2 different suppliers • 140 TB FC attached external SATA disk arrays • New software • SLC3, RHEL 3 • New CASTOR stager • New challenges • Oracle SAN setup • Alice data challenge Jan.van.Eldik@cern.ch FIO/DS

Conclusions A lot of work has been done to • Stabilize Hardware and Software • Automate + hand over basic operations • Integrate into standard work flows • Get more out of available hardware Achieved pro-active data management Jan.van.Eldik@cern.ch FIO/DS

Useful links “Standing on the shoulders of giants” Tim SmithCHEP 2004 http://indico.cern.ch/contributionDisplay.py?contribId=374&sessionId=10&confId=0 Helge MeinhardCHEP 2004 http://indico.cern.ch/contributionDisplay.py?contribId=325&sessionId=10&confId=0 Peter KelemenCERN IT “After C5” http://cern.ch/Peter.Kelemen/talk/2004/C5/diskserver Jan Iven HEPiX 2004 Edinburgh http://hepwww.rl.ac.uk/hepix/nesc/iven.pdf Jan.van.Eldik@cern.ch FIO/DS

Managing managed storage CERN Disk Server operations

Managing managed storage CERN Disk Server operations

Presentation Transcript

CERN Document Server

Disk Storage Arrays

Storage Foundation disk support

MANAGED SERVICES OPERATIONS

Disk Storage : Organization

CERN Tape Operations

Managing Storage

Disk Storage Systems

DSS Disk Storage Update

DISK STORAGE

Managing Storage

Chapter 6 – Managing Disk and Data Storage

Disk Storage Systems

Disk Based Storage

MANAGING DISK STORAGE

Hard Disk Storage

Managed Dedicated Server

Disk Storage Systems

Disk Storage Devices

Managing Operations

Managing Storage