140 likes | 248 Views
Fingerprinting the Datacenter. Marcel Flores Shih-Chi Chen. Motivation. Large datacenters often encounter large and complex crises Come in the form of dipping below SLAs Often complex and difficult to diagnose Can be costly to operators. Approach.
E N D
Fingerprinting the Datacenter • Marcel Flores • Shih-Chi Chen
Motivation • Large datacenters often encounter large and complex crises • Come in the form of dipping below SLAs • Often complex and difficult to diagnose • Can be costly to operators
Approach • Want to quantify the state of the datacenter in a compact manner • Can be compared to past crises • Allows for easy identification and diagnoses of crises
Fingerprints • Tracks quantiles for each metric • Determines hot/normal/cold status for each metric • Includes only relevant metrics • Uses a similarity metric for comparison
Fingerprint - details • Track quantiles of each metric • Resistant to outliers • Measure 25%, 50%, 95% quantiles • Determines if each measurement is Hot (>98th percentile), Cold (<2nd percentile), or Normal
Relevant Metrics • Select metrics via feature selection and classification • Technique from statistical machine learning • Eliminates noise from the fingerprints
Identification • Define a similarity metric • Allows comparison between current state fingerprint and known crisis fingerprints • Identification Threshold determines when two fingerprints are considered the same
Evaluation • Used data gathered from a real live data center consisting of hundreds of servers • 240 days • About 100 metrics per server
Evaluation Criteria • Discrimination: when are two crises different? • Identification Stability: when does it provide a consistent suggestion? • Identification Accuracy: when does it provide the correct label?
Offline • Uses all known data • Attempts to recall the crises that it saw • Provides a baseline • What is the best possible (if it knew everything)? • Dominates existing methods, near perfect.
Quasi-Online • More realistic, but still computes the thresholds offline • Doesn’t know the future • Known and Unknown accuracy of 85%
Online • Everything online, computed on the fly • Including Identification Threshold • Achieved both accuracies to 80% (with 10 seeding crises) • 78% known, 74% unknown (with 2) • Does well with smaller seeding set!
A note on Thresholds • Hot/Cold thresholds were selected arbitrarily • Ran evaluations with varied values from other statistical methods • Showed reduced discriminative power (95% down from 99%) • Why mess with what works?