AVMON: Optimal & Scalable Discovery of Consistent Availability Monitoring Overlays for Distributed Systems

AVMON: Optimal and Scalable Discovery of Consistent Availability Monitoring Overlays for Distributed SystemsRams´es Morales and Indranil GuptaDept. of Computer Science, University of Illinois at Urbana-Champaign

Outline • Introduction • Design Goals • Existing Solutions and Deficiencies • AVMON • Final Observation

Introduction • Research Problem • The selection and discovery of consistent availability monitoring overlay • Why? • Large-scale distributed applications: replication, multicast... • Churn: rapid and continuous arrival and departure, failure, and birth and death • Availability-aware strategies • Availability monitoring service: maintain long-term availability information

Introduction (cont.) • Challenges • Selfish nodes • Colluding nodes • Distributed application challenges • The availability monitoring problem consists of two sub-problems • The selection and discovery of availability monitoring overlay • Availability history maintenance

Design Goals • Consistency • Monitoring relationships should be consistent • Maintain long-term availability history • Avoid transferring histories due to churn • Avoid monitoring set pollution • Verifiability • Correctly verify the monitoring relationship between any two nodes • Load Balancing • Scalability

Design Goals (cont.) • Randomness • A node's monitoring set should be picked uniformly at random: • In an identically distributed fashion: reduces pollution, helps in load balancing • Independently of one another: avoids groups of nodes being together in several monitoring sets • Discoverability • Any node should be able to quickly find its monitoring and target sets • Any node should be able to locate at least a constant number of any other node's monitoring sets

Existing Solutions • Self-reporting • Selfish lying nodes • Central availability monitor • Scalability • Load balancing • DHT-based • Consistency • Verifiability • Randomness

AVMON • System Model • Distributed system • Nodes may leave, fail, rejoin, be born, die • Death is silent and not explicit • Communication is reliable and timely • Nodes have persistent storage • Stable system size

AVMON • Overview • Hash-based implementation of the monitoring relationship • Any node can execute the hash-based function to determine if two nodes have a monitoring relationship • Consistent, random, and verifiable • Guarantees that each node has an O(K) nodes in its monitoring set • The system maintains a coarse overlay to discover monitoring relationships between nodes and inform the relevant nodes

AVMON • Monitor discovery via coarse view • Monitors are discovered via each node maintaining a coarse view • The coarse view has a maximal CVS entries • The goal is to have the coarse view be random • This is done by two sub-protocols

AVMON • Joining sub-protocol • Freshly joining & rejoining • Tell CVS other nodes about X • Node X sends a JOIN message to a random node Y • Join message has ID(X) & an integer weight C • When Y receives a JOIN message s.t. C ≠ 0 • It adds X to its coarse view if it is not there already and if there is room for it • It decrements C • It forwards two JOIN messages on behalf of X to random nodes in Y's coarse view. Each of the messages with C/2

AVMON • Coarse view maintenance and discovery • Executed once every protocol period • This protocol has three tasks • Eliminate nodes that left the system • Shuffle new random entries into the coarse view • Discover monitoring relationships along the way • Node X randomly picks a node Z in its coarse view and pings it • Unresponsive nodes are removed • Verify monitoring relationships between pairs of nodes (U, V) from X's and Z's coarse views • Pairs that satisfy the relationship are informed by a NOTIFY message • X selects new coarse view at random from the union of X's and Z's coarse views

AVMON • Using the monitoring overlay • Node X receives a NOTIFY(U, X) • X checks if U is already in its monitoring set • If not, X rechecks the hash-consistent condition • If the condition is true, X adds U to its monitoring set • Node X receives a NOTIFY(X, U) • X checks if U should be in its target set • Node X monitors the availability of its target set by sending monitoring pings every monitoring period, and it maintains the availability history of those nodes • Node X is responsible of reporting a subset of its monitoring set to any node Y that requires it

AVMON • Forgetful pinging • An optimization for cleaning monitoring and target sets from nodes that may never join the system again • These nodes can not be deleted • Instead, reduce the monitoring ping frequency • If a node in X's target set is not responsive for a period of time longer than the threshold • Collusion-resilience • If X has O(N/log(N)) colluding nodes, it is probabilistically impossible for them to pollute X's monitoring set • Even for up to O(N/log(N)) colluding relationshipss

Final Observation • In the joining sub-protocol, how does X know about Y • Effect of coarse view maintenance and discovery on history transfers

Thank You

AVMON: Optimal & Scalable Discovery of Consistent Availability Monitoring Overlays for Distributed Systems

AVMON: Optimal & Scalable Discovery of Consistent Availability Monitoring Overlays for Distributed Systems

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: