Scalable Monitoring & Autonomous Management of Cloud Environments

Scalable Monitoring & Autonomous Management of Cloud Environments Idit KeidarTechnion

Executive Summary • Goal: Scalable Monitoring and Autonomous Management of Cloud Environments • Approach: Distributed Local Computations • Combine theory and experimental work • Task 1: Robust aggregation • Task 2: Overcome (& understand impact of) loss, failures in gossip-based membership • Task 3 (long term): Local and adaptive self-organization

Autonomous Self* Clouds • Complex autonomous decision making • Collaboratively computing functions The Haifa data center is too hot! They’re going to turn on the sprinklers - need to backup Let’s reduce power

Centralized Solutions Don’t Cut It • Load • Communication costs • Delays • Fault-tolerance

Classical Dist. Solutions Don’t Cut It • Global agreement before any output • Repeated invocations to adapt to changes • High latency, high load • By the time synchronization is done, input may have changed … the result is irrelevant • Frequent changes -> inconsistent snapshots • Synchronization typically relies on leader • difficult and costly to maintain

Locality to the Rescue! L • Nodes make local decisions based on communication with some proximate nodes • rather than the entire network • Infinitely scalable • Fast, low overhead, low power

What is Locality? • Worst case view • Interesting problems have (a few) inherently global instances  • Average case view • Requires an a priori distribution of the inputs  • Our approach: be “as local as possible” • E.g., Veracity Radius of distributed aggregation [BKLSW’06] : how far does a node need to look in order to know the globally correct result?

Task 1: Distributed Clustering for Robust Aggregation Years 1-2 With IttayEyal and Raphael Rom

Clouds Need Monitoring • Load balancing storage/computation • Need to know load distributions • Ensuring a certain replication level • Need to know number of failures per object • Discovering problems – detecting anomalies • Isolated outliers (malfunctioning node) • Anomalous clusters • All nodes running some OS version are overloaded due to attack • Overheating area

Aggregation Needs • Robustness to data errors • Ignore erroneous reports (outliers) • See Amazon S3’s recent crash caused by corrupt data being gossiped • Data is multi-dimensional • Physical location X Heat: Where is there a fire? • Cluster group X Load: Overloaded clusters? • Software version X Performance: What software are perturbed nodes running?

Solution Requirements • Decentralized, tolerating crashes • Scalable, low cost • Clouds run 100,000s of machines • Machines are busy doing real work • Dynamic: deal with churn, value changes • All nodes learn the outcome • Data used for self-configuring/self-managing systems, so all nodes need to know the outcome in order to take appropriate actions

Proposed Approach • Gossip-based diffusion • Crash robust, scalable • Constant size synopses represent data distribution as set of Gaussian clusters Estimated Distribution Gaussian 2 Gaussian 1 Samples taken

Merging Synopses • Gossiping nodes exchange synopses, merge them to improve accuracy + = merge

Preliminary Results - Robustness Sample Distribution Regular Aggregation Robust Aggregation No crashes With crashes

Estimating Distributions - Pareto PDF CDF

Estimating Distributions - Uniform PDF CDF

Multi-Dimensional Distributions Samples Taken AggregatedSynopsis

Key Challenges • Test with real data • Analyze convergence properties • Understand locality • Deal with changing inputs

Task 2: Fault- & Loss-Tolerant Gossip-Based Membership: Formal Analysis Years 1-2 With Maxim Gurevich

Why Membership? • Each node needs to know some live nodes • In a dynamically changing system (churn) • Gossip partners • Random choices make gossip protocols work • Unstructured overlay networks • E.g., among super-peers • Random links provide robustness, expansion • Gathering statistics • Probe random nodes

Desirable Properties • Each node has a local view (set of node ids) • Small views, e.g., logarithmic • Load balance of representation in views • Uniform sample: In every node’s view, all other nodes appear with equal probability • Spatial independence: No correlation among views of different nodes • Temporal independence: fast decay of correlation with past views

Existing Work • Many protocols studied only empirically  • Achieve good load balance  • Induce spatial dependence  • No bound on temporal dependence  • A few analyzed theoretically • Uniformity, load balance, spatial indep.  • Unrealistic assumptions  • Atomic actions with bi-directional communication • No churn, failures, or message loss • No bounds on temporal dependence 

Our Goal • Bridge “Theory” and “Practice” • A practical protocol • Working despite message loss, churn, failures • No complex bookkeeping for atomic actions • Formally prove the 5 desirable properties • Should perfectly hold in good circumstances • Quantify how much they degrade due to averse conditions – message loss, churn, etc.

Send & Forget Membership w w w w • No bi-directional communication • Overcomes message loss • Simple • Amenable to formal analysis u v u v u u v v after loss before after u -> v after dup

Challenges • Setting parameters • View size, how often to dup? • Proving all 5 desirable properties w/out loss • Markov Analysis 1: In-degree distribution • Markov Analysis 2: Markov Chain of all reachable global states • stationary probability, mixing, membership properties • Quantify impact of loss, churn, failures • Bound dependencies, degree imbalance

Task 3: Local and Adaptive Self-Organization and Topology Maintenance Years 2-3

Decisions, Decisions, • Making autonomous decisions based on some function computation • E.g., optimization function for topology maintenance • Devise local distributed computations for these • Challenge 1: Prove instance-based locality • Challenge 2: Test with real data

Summary (Repeated) • Goal: Scalable Monitoring and Autonomous Management of Cloud Environments • Approach: Distributed Local Computations • Combine theory and experimental work • Task 1: Robust aggregation • Task 2: Overcome (& understand impact of) loss, failures in gossip-based membership • Task 3 (long term): Local and adaptive self-organization

Scalable Monitoring & Autonomous Management of Cloud Environments