1 / 20

Managing managed storage CERN Disk Server operations

Managing managed storage CERN Disk Server operations. HEPiX 2004 / BNL Data Services team: Vladim ír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik , David Hughes, Gordon Lee, Tony Osborne, Tim Smith. Outline. Which are our “Data Services”? Disk server hardware @ CERN Management tools

vonda
Download Presentation

Managing managed storage CERN Disk Server operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing managed storageCERN Disk Server operations HEPiX 2004 / BNL Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith

  2. Outline • Which are our “Data Services”? • Disk server hardware @ CERN • Management tools • What’s next? Jan.van.Eldik@cern.ch FIO/DS

  3. A lot of hardware • Disk storage • 350 “storage in a box” Linux diskservers • 6700 disks • 550 TeraBytes of raw disk space • Tape storage • 2 robotic installationseach with 5 STK 9310 silos • 50 9940B drives, 14000 tapes, 2.8 PB • 20 9840 drives, 8000 tapes, 160 TB Jan.van.Eldik@cern.ch FIO/DS

  4. Many applications • 200 CASTOR! • 40 Oracle • 20 CDR • 10 AFS scratch • dCache, LHC@home, … • LCG, OpenLab, EGEE, data challenges • 40 in repair/spare A very heterogeneous environment! And very dynamic too  Jan.van.Eldik@cern.ch FIO/DS

  5. Players • Many teams involved: • Application responsibles / Users • Service managers • System administrators team • Suppliers • Software often not redundant…need to minimize downtime! • …so the hardware should be! Jan.van.Eldik@cern.ch FIO/DS

  6. “Storage in a box” 13 different hardware configurations: • 8 – 26 IDE disks, hot-swappable trays • 2 – 4 3-Ware RAID controllers • 2 CPUs • 2 – 3 power supplies • GigE network card Should be redundant… Jan.van.Eldik@cern.ch FIO/DS

  7. hardware interventions • 55 interventions since Sep 1 • disk replacements (70%) • trays, cables, fans, PSU • 33% involve (un)scheduled downtime  • Older hardware harder to maintain • One supplier out of business • Incidents to spice up life… Jan.van.Eldik@cern.ch FIO/DS

  8. 4.5% 4.0% 1224 disks replaced 3.5% 3.0% % Broken Mirrors 2.5% 2.0% 1.5% 1.0% Jumbo servers 0.5% 0.0% Dec-03 Jan-04 Feb-04 Mar-04 Apr-04 May-04 Jun-04 Jul-04 Aug-04 Sep-04 Disk replacement Christmas • 10 months before case agreed: Head instabilities • 4 weeks to execute • 1224 disks exchanged (=18%); And the cages as well Jan.van.Eldik@cern.ch FIO/DS

  9. 65 Jumbo’s • 1 – 1.5 TB raw disk space • 6800 3-Ware controllers • 600 MHz PIII • No PXE • Becoming hard to maintain • Many still under warranty  • Make good mini-bars! Jan.van.Eldik@cern.ch FIO/DS

  10. 175 4U servers • 4U (5U) rack mounted • 1 – 1.5 TB • 2 * 3-Ware 7000 series currently upgrading firmware • 2 * 1 GHz PIII’s • No PXE(yet) • Various maintenance issues Jan.van.Eldik@cern.ch FIO/DS

  11. 115 8U servers • 8U rack mounted • 2 – 2.5 Tb • 3 – 4 * 3-Ware 7500(6)-8 • 2 * 2.4 GHz Xeon • Well controlled, well maintained, well behaved, after disk replacements  Jan.van.Eldik@cern.ch FIO/DS

  12. Diskserver evolution Jan.van.Eldik@cern.ch FIO/DS

  13. That was then… • HW RAID1 • Ext2 filesystems many of them • 13 different kernels! RedHat 6.1/6.2, 7.2/7.3, 2.1ES • Need for automation + standardization ELFms toolsuite Quattor – installation + configuration LEMON – performance + exception monitoring LEAF – Hardware and State Management Jan.van.Eldik@cern.ch FIO/DS

  14. …this is now • RedHat 7.3, preparing for SLC3 • Oracle: RHEL 2.1, preparing RHEL 3kernel has old 3-Ware driver  • HW RAID5 + hot spare disk • Up to 50% more usable space • On 3-Ware 7000 controller with up-to-date firmware • SW RAID0 + XFS • Improved performance expected iozone benchmark • Old XFS version • Improved kernel / elevator tuning Jan.van.Eldik@cern.ch FIO/DS

  15. Updating the toolbox • SMART – to predict disk failuredaily and weekly self-tests, on every disk • IPMI v1.5 • HW monitoring and event control • Power control, resets • Lm_sensors – temperature monitoring Hardware and software specific All data flows into Lemon repository Jan.van.Eldik@cern.ch FIO/DS

  16. Wintertime? Jan.van.Eldik@cern.ch FIO/DS

  17. This is now • Quattorized + Lemonized • Rely on Operator and SysAdmin teams • Operated in same way as PC farms • Getting more out of suppliersBIOS upgrade necessary for PXE enabling BTW: most applies to tapeservers as well Jan.van.Eldik@cern.ch FIO/DS

  18. What’s next? • New hardware • 360 TB “SATA in a box”, 2 different suppliers • 140 TB FC attached external SATA disk arrays • New software • SLC3, RHEL 3 • New CASTOR stager • New challenges • Oracle SAN setup • Alice data challenge Jan.van.Eldik@cern.ch FIO/DS

  19. Conclusions A lot of work has been done to • Stabilize Hardware and Software • Automate + hand over basic operations • Integrate into standard work flows • Get more out of available hardware Achieved pro-active data management Jan.van.Eldik@cern.ch FIO/DS

  20. Useful links “Standing on the shoulders of giants” Tim SmithCHEP 2004 http://indico.cern.ch/contributionDisplay.py?contribId=374&sessionId=10&confId=0 Helge MeinhardCHEP 2004 http://indico.cern.ch/contributionDisplay.py?contribId=325&sessionId=10&confId=0 Peter KelemenCERN IT “After C5” http://cern.ch/Peter.Kelemen/talk/2004/C5/diskserver Jan Iven HEPiX 2004 Edinburgh http://hepwww.rl.ac.uk/hepix/nesc/iven.pdf Jan.van.Eldik@cern.ch FIO/DS

More Related