240 likes | 399 Views
Failure Trends in a Large Disk Drive Population. Authors: Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun. Motivation . 90% of all new information is stored on magnetic disks. Most of such data stored on HDD
E N D
Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun
Motivation • 90% of all new information is stored on magnetic disks. • Most of such data stored on HDD • Study failure patterns and key factors that affect the life • Analyze the correlation between failures and parameters that are believed to impact life of HDD • Why ? --better design and maintenance of storage systems
Previous studies • Mostly accelerated aging experiments – poor predictor • Moderate size • Stats present on returned units from warranty databases • No insight on what actually happened to drive during operation
Our study • Large study – examining hard drives in Google’s infrastructure. 1 lac disk drives • Disk population size is large but depth and detail of study from a end users point of view • Why? Manufacturers say failure rate is below 2% but end user experiences much high failure rate • Some studies say the failure rate is 20-30% when manufacturer says no prob and it fails on field
SYSTEM HEALTH INFRASTRUCTURE • Collection layer – collects data from each server and dumps to repository • Storage based on BIGTABLE which is based on GFS. Has 2D data cells and 3rd dimension for time version • Database has complete history of environment, error, config and repair events • A daemon runs on each machines. It is light weight & gives info to collectors • Large scale analysis done by MapReduce • Computation is readily available, user focuses on algorithm of computations
Some other info • Data collected over nine months. • Mix of HDD--- diff ages, manufacturers and models • Failure info mined from previous repair databases upto 5 years • We monitor temp, activity levels and SMART parameters • Results are not affected by population mix
Results • Utilization • Previous notion – high duty cycles affect disk drives negatively
Utilization AFR • More utilization, more failures true only for infant mortality stage and end stage • After 1st year high utilization is only moderately over low utilization • How is this possible- Survival of the fittest, previous correlation based on accelerated life test. Same is seen here. • Conclusion – Utilization has much weaker correlation to failure than assumed before
Temperature • Previous belief temperature change of 15C can double failure rate • PDF – Failure does not increase with temperature. Infact lower temperatures may have higher failure rate • For age vs AFR – flat failure rate for mid range temp, Modest increase for low temps • High temp is not associated with high failure rate, except when old • Conclusion – If moderate temp range is considered, temp is not a strong factor for failure rate
SMART Data Analysis • Some signals more relevant to disk failures • Parameters • Scan errors • Reallocation counts • Offline Reallocations • Probational counts • Miscellaneous signals
Scan errors • Errors that are reported when drives scan the disk surface in the background • Indicative of surface defects • Consistent impact on AFR • Drives with scan errors are 39 times more likely to fail after first scan error
Reallocation Counts • Represents the number of times a faulty sector is remapped to new physical sector • Consistent impact on AFR • 14 times more likely to fail
Offline reallocations • Subset of reallocation counts • Reallocated sectors found during background scrubbing • Survival probability worse than total reallocations • 21 times more likely to fail
Probational counts • Sectors are on ‘probation’ until they fail permanently or work without problems • 16 times more likely to fail • Threshold is 1
Miscellaneous signals • Seek errors • CRC errors • Power cycles • Calibration retries • Spin retries • Power-on hours • Vibration
Conclusion • Larger population size used compared to previous studies • Lack of consistent pattern of failures for high temperatures or utilization levels • SMART parameters are well correlated with failure probabilities • Prediction models based only on SMART parameters is limited in accuracy