Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002

Measurements of Hardware Reliability in the Fermilab FarmsHEPix/HEPNT, Oct 25, 2002 S. Timm Fermilab Computing Division Operating Systems Support Dept Scientific Computing Support Group timm@fnal.gov HEPiX

Introduction • Four groups of Linux nodes have made it through three year life cycle (186 machines). • All from commodity “white box” vendors • Our goal—to measure the hardware failure rate and calculate total cost of ownership. timm@fnal.gov HEPiX

Burn-in and Service • All nodes are given 30-day burn-in • Test CPU with seti@home • Disk test with bonnie • Network test with nettest • Failures during burn-in period are vendor’s problem to fix (parts and labor). • After burn-in period, there is 3 year warranty on parts, Fermilab covers the labor through on-site service provider Decision One. • Lemon law—any node down for 5 straight days or 5 separate instances must be completely replaced. timm@fnal.gov HEPiX

Definition of Hardware Fault • Failure of hardware such that it makes the machine not usable • Hardware changed out during burn-in period doesn’t count • Fan replacements (routine maintenance) don’t count. • Sometimes we replaced a disk and it didn’t solve the problem…that is counted • Multiple service calls in same incident count as single hardware fault. timm@fnal.gov HEPiX

Infant Mortality • The routine hardware calls don’t count swap-outs during the burn-in period • We expect and are prepared for initial quality problems. • During install and burn-in, we have demanded and got total swap-outs of • motherboards (2 different times) • Cases (once) • Racks (once) • Power supplies (twice) • System disks (twice in same group of nodes) timm@fnal.gov HEPiX

IDE/DMA errors • Serverworks LE chipset had broken IDE chipset • Observed in following Pentium III boards: Tyan 2510, 2518, Intel STL2, SCB2, Supermicro 370DLE, ASUS CUR-DLS—basically anything for sale in 2001. (Tyan 2518 best of a bad lot). • Hardware fault observed both in Windows and Linux and with hardware logic analyzer—Chipset thought DMA was still on even though drive had finished transfer. • System most sensitive when trying to write system disk and swap at the same time. timm@fnal.gov HEPiX

IDE/DMA errors, cont’d • Behavior varied by disk drive—Seagate disk drives—file corruption, Western Digital drives, occasional hangs of system, IBM drives—OK (up to 2.4.9 kernel). • Vendor did 2 complete system disk swaps, first WD, then IBM. • Problem reappears with new 2.4.18 kernel “feature”, shuts down the drive and halts the machine if one of these errors happens. • Most IDE/DMA errors not counted in error summary below. timm@fnal.gov HEPiX

CPU Power—Fermi Cycles • CPU clock speed numbers not consistent between Intel PIII, Intel Xeon (P4), and AMD Athlon MP • SPEC CPU2000 numbers don’t exist far enough back for historical comparison • We define PIII 1 GHz = 1000 Fermi Cycles • Compilers that Fermi is tied to can’t give the full performance promised for SPEC CPU2000 numbers—AMD MP1800+ faster than Xeon 2.0GHz. • Performance is measured by real performance of our applications on systems. timm@fnal.gov HEPiX

Farms Buying History timm@fnal.gov HEPiX

First Linux farm 36 nodes, ran from 1998-2001 32 hardware failures—25 system disks, six power supplies, one memory. These nodes had only one disk, used for system, staging, swap, everything, and swapped heavily due to low memory. Failures correlated to power outages Rate—0.024 failures/machine-month. timm@fnal.gov HEPiX

Mini-tower farms, 1999 150 nodes, organized into 3 farms of 50. CDF, D0, Fixed Target, 50 each. Bought Sep 1999, just out of warranty now in Sep 2002. 140 nodes still in the farm, statistics based on them. 3 disks in each, one system and 2 data. timm@fnal.gov HEPiX

Mini-tower farms, cont’d. • Fixed target—50 nodes, only 5 service calls over 3 years. • 1 Memory problem, 1 bad data disk, 3 bad motherboards (one caused from failed BIOS upgrade). • CDF—50 nodes, 19 service calls over 3 years • 5 system disk, 2 power supply, 9 data disk, 2 motherboard, 1 CPU. • D0—40 nodes, 18 service calls over 3 years • 9 system disk, 2 power supply, 3 data disk, 3 motherboard, 1 network card. timm@fnal.gov HEPiX

timm@fnal.gov HEPiX

Analysis • Four different failure rates: • Old farm—0.024 failures/machine month • FT farm—0.0028+/-0.0012 failures/machine month • CDF—0.0083+/- 0.0021 failures/machine month • D0—0.0130+/-0.0044 failures/machine month • Statistical analysis reveals the distributions are not statistically consistent with each other, also not Poisson. • CDF and D0 are identical hardware in same computer room. timm@fnal.gov HEPiX

Analysis continued • Failure rate could depend on any of the following • Frequency of use (D0 farm typically loaded > 98%, others less) • Vigilance of system administrators in finding and addressing hardware errors • Phase of moon. • Dependability of hardware. • Cooling efficiency timm@fnal.gov HEPiX

Residual value • Latest farm purchase got us 2 Fermi cycles per dollar. • Residual value of 140 nodes bought in 1999 is $70K—they could be replaced with 40 of the nodes we are buying today. • Cost of electricity=180W*150 machines * 26280 hrs *.047$/kWh=$33.3K timm@fnal.gov HEPiX

Total Cost of Ownership • Depreciation--$339K • Maintenance--$20K (estimate) • Electricity--$33K (estimate) • Memory upgrades--$23K • Total--$415K • Personnel—2 FTE * 3 years—how much? • (doesn’t count developer time, user time) timm@fnal.gov HEPiX

Lessons Learned • Hitech has been out of business for more than a year • Decision One was still able to get replacement parts from component vendors, at least for processors and disk drives • Decision One identified replacement motherboard since initial one isn’t manufactured anymore. • Conclusion—we can survive if a vendor doesn’t stay in business for the length of the 3 year warranty. timm@fnal.gov HEPiX

Cost forecast for 2U units • Maintenance costs will be higher— • have already racked up $10K of maintenance in 1.5 years of deployment on 64 CDF nodes, for example. • Dominated by memory upgrades and disk swaps. timm@fnal.gov HEPiX

2U Intel boards: • 50 2U nodes, D0, bought Sep. 00. • 9 PS replaced during burn-in. • Since then—1 system disk, 2 PS, 6 memory, 4 data disk, 6 motherboard, 1 net. • Four nodes have been to shop > 3 times. • 0.016 failures/machine month • 23 nodes for CDF bought Jan ’01 • 1 system disk, 11 power supplies, 1 data disk, 1 network card so far. • 0.031 failures/machine month timm@fnal.gov HEPiX

2U Supermicro boards • 64 nodes for CDF bought Jun ’01 • 10 system disks, 2 data disks, 3 motherboards, 1 floppy, 2 batteries. • (not to mention total swap of system disks twice) • 0.010 failures/machine month • 40 nodes for FT bought Jun ’01 • Only 1 problem so far, memory. • 0.002 failures/machine month. • Identical hardware in 2 groups but failure rate is different by factor of five! timm@fnal.gov HEPiX

2U Tyan boards • 32 bought for D0, arrived Dec 28, 2001 (after being sent back for new motherboards and cases). • 3 hardware calls so far, all system disks. • 0.003 failures/machine month • 16 bought for KTeV, arrived March ’02 • 1 hardware call so far, data disk • 0.009 failures/machine month • 32 bought for CDF, arrived April ’02 • 2 hardware calls so far, system disk, CPU timm@fnal.gov HEPiX

SUMMARY timm@fnal.gov HEPiX

Hardware errors by type timm@fnal.gov HEPiX

Conclusions thus far • We now format a disk and check for bad blocks before placing service call to replace—it can often rescue a disk. • At moment, software-related hangs are much greater problem than hardware errors and more time consuming to diagnose. • With 750 machines and 0.01 failures/machine month we can expect 8 hardware failures/month. • GRAND TOTAL—10692 machine-months so far, 0.0122 failures per machine-month. • Machines currently running are averaging 0.0105 failures per machine-month. timm@fnal.gov HEPiX

Cluster errors over time timm@fnal.gov HEPiX

timm@fnal.gov HEPiX

Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002

Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002

Presentation Transcript

Psychological and Educational Tests and Measurements

Welcome to

Computerized Speech Lab CSL

Integration of Synchro-Phasor Measurements in Power Systems State Estimation for Enhanced Power System Reliability

Reliability

Chapter 1: Measurements

Hardware Description Language - Introduction

Reliability Centered Maintenance Analysis

Calculating Reliability of Quantitative Measures

Practical Item Writer Training

Reliability vs. Quality (Glesner, Kececioglu, et al.)

Software Metrics and Measurements

MiniBooNE, a neutrino oscillation experiment at Fermilab

Shadow Mapping with Today’s OpenGL Hardware

PC Farms at CERN

Hardware Description Language - Introduction

Characteristics of a RTS

MEASUREMENTS, EMISSIONS and DISPERSION of NOx, HC, PM10

Ch3.1 – Scientific Measurement Qualitative Measurements – no #’s

“ Reliability of Passive Systems that utilize Natural Circulation ” M. Marquès