Fingerprinting the Datacenter

Fingerprinting the Datacenter • Marcel Flores • Shih-Chi Chen

Motivation • Large datacenters often encounter large and complex crises • Come in the form of dipping below SLAs • Often complex and difficult to diagnose • Can be costly to operators

Approach • Want to quantify the state of the datacenter in a compact manner • Can be compared to past crises • Allows for easy identification and diagnoses of crises

Fingerprints • Tracks quantiles for each metric • Determines hot/normal/cold status for each metric • Includes only relevant metrics • Uses a similarity metric for comparison

Fingerprint - details • Track quantiles of each metric • Resistant to outliers • Measure 25%, 50%, 95% quantiles • Determines if each measurement is Hot (>98th percentile), Cold (<2nd percentile), or Normal

Relevant Metrics • Select metrics via feature selection and classification • Technique from statistical machine learning • Eliminates noise from the fingerprints

Identification • Define a similarity metric • Allows comparison between current state fingerprint and known crisis fingerprints • Identification Threshold determines when two fingerprints are considered the same

Evaluation • Used data gathered from a real live data center consisting of hundreds of servers • 240 days • About 100 metrics per server

Evaluation Criteria • Discrimination: when are two crises different? • Identification Stability: when does it provide a consistent suggestion? • Identification Accuracy: when does it provide the correct label?

Offline • Uses all known data • Attempts to recall the crises that it saw • Provides a baseline • What is the best possible (if it knew everything)? • Dominates existing methods, near perfect.

Quasi-Online • More realistic, but still computes the thresholds offline • Doesn’t know the future • Known and Unknown accuracy of 85%

Online • Everything online, computed on the fly • Including Identification Threshold • Achieved both accuracies to 80% (with 10 seeding crises) • 78% known, 74% unknown (with 2) • Does well with smaller seeding set!

A note on Thresholds • Hot/Cold thresholds were selected arbitrarily • Ran evaluations with varied values from other statistical methods • Showed reduced discriminative power (95% down from 99%) • Why mess with what works?

Fingerprinting the Datacenter

Fingerprinting the Datacenter

Presentation Transcript

Fingerprinting

Fingerprinting

The Software Driven Datacenter

Fingerprinting

Fingerprinting

The Finnish Datacenter Stronghold

Fingerprinting

Fingerprinting

The Datacenter Firewall

Fingerprinting

Fingerprinting

Fingerprinting

Fingerprinting

Fingerprinting

Fingerprinting

Fingerprinting the Datacenter

Virtualization in the Datacenter

Datacenter Power Structure #datacenter

Fingerprinting

Fingerprinting

Fingerprinting

Fingerprinting