1 / 16

AVMON: Optimal & Scalable Discovery of Consistent Availability Monitoring Overlays for Distributed Systems

This study presents AVMON, a system for selecting and discovering consistent availability monitoring overlays in large-scale distributed systems. The system aims to maintain long-term availability information while addressing challenges such as churn, selfish and colluding nodes, and scalability. AVMON achieves consistency, verifiability, load balancing, randomness, and discoverability through its hash-based implementation and coarse view maintenance protocols.

kcuadra
Download Presentation

AVMON: Optimal & Scalable Discovery of Consistent Availability Monitoring Overlays for Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AVMON: Optimal and Scalable Discovery of Consistent Availability Monitoring Overlays for Distributed SystemsRams´es Morales and Indranil GuptaDept. of Computer Science, University of Illinois at Urbana-Champaign

  2. Outline • Introduction • Design Goals • Existing Solutions and Deficiencies • AVMON • Final Observation

  3. Introduction • Research Problem • The selection and discovery of consistent availability monitoring overlay • Why? • Large-scale distributed applications: replication, multicast... • Churn: rapid and continuous arrival and departure, failure, and birth and death • Availability-aware strategies • Availability monitoring service: maintain long-term availability information

  4. Introduction (cont.) • Challenges • Selfish nodes • Colluding nodes • Distributed application challenges • The availability monitoring problem consists of two sub-problems • The selection and discovery of availability monitoring overlay • Availability history maintenance

  5. Design Goals • Consistency • Monitoring relationships should be consistent • Maintain long-term availability history • Avoid transferring histories due to churn • Avoid monitoring set pollution • Verifiability • Correctly verify the monitoring relationship between any two nodes • Load Balancing • Scalability

  6. Design Goals (cont.) • Randomness • A node's monitoring set should be picked uniformly at random: • In an identically distributed fashion: reduces pollution, helps in load balancing • Independently of one another: avoids groups of nodes being together in several monitoring sets • Discoverability • Any node should be able to quickly find its monitoring and target sets • Any node should be able to locate at least a constant number of any other node's monitoring sets

  7. Existing Solutions • Self-reporting • Selfish lying nodes • Central availability monitor • Scalability • Load balancing • DHT-based • Consistency • Verifiability • Randomness

  8. AVMON • System Model • Distributed system • Nodes may leave, fail, rejoin, be born, die • Death is silent and not explicit • Communication is reliable and timely • Nodes have persistent storage • Stable system size

  9. AVMON • Overview • Hash-based implementation of the monitoring relationship • Any node can execute the hash-based function to determine if two nodes have a monitoring relationship • Consistent, random, and verifiable • Guarantees that each node has an O(K) nodes in its monitoring set • The system maintains a coarse overlay to discover monitoring relationships between nodes and inform the relevant nodes

  10. AVMON • Monitor discovery via coarse view • Monitors are discovered via each node maintaining a coarse view • The coarse view has a maximal CVS entries • The goal is to have the coarse view be random • This is done by two sub-protocols

  11. AVMON • Joining sub-protocol • Freshly joining & rejoining • Tell CVS other nodes about X • Node X sends a JOIN message to a random node Y • Join message has ID(X) & an integer weight C • When Y receives a JOIN message s.t. C ≠ 0 • It adds X to its coarse view if it is not there already and if there is room for it • It decrements C • It forwards two JOIN messages on behalf of X to random nodes in Y's coarse view. Each of the messages with C/2

  12. AVMON • Coarse view maintenance and discovery • Executed once every protocol period • This protocol has three tasks • Eliminate nodes that left the system • Shuffle new random entries into the coarse view • Discover monitoring relationships along the way • Node X randomly picks a node Z in its coarse view and pings it • Unresponsive nodes are removed • Verify monitoring relationships between pairs of nodes (U, V) from X's and Z's coarse views • Pairs that satisfy the relationship are informed by a NOTIFY message • X selects new coarse view at random from the union of X's and Z's coarse views

  13. AVMON • Using the monitoring overlay • Node X receives a NOTIFY(U, X) • X checks if U is already in its monitoring set • If not, X rechecks the hash-consistent condition • If the condition is true, X adds U to its monitoring set • Node X receives a NOTIFY(X, U) • X checks if U should be in its target set • Node X monitors the availability of its target set by sending monitoring pings every monitoring period, and it maintains the availability history of those nodes • Node X is responsible of reporting a subset of its monitoring set to any node Y that requires it

  14. AVMON • Forgetful pinging • An optimization for cleaning monitoring and target sets from nodes that may never join the system again • These nodes can not be deleted • Instead, reduce the monitoring ping frequency • If a node in X's target set is not responsive for a period of time longer than the threshold • Collusion-resilience • If X has O(N/log(N)) colluding nodes, it is probabilistically impossible for them to pollute X's monitoring set • Even for up to O(N/log(N)) colluding relationshipss

  15. Final Observation • In the joining sub-protocol, how does X know about Y • Effect of coarse view maintenance and discovery on history transfers

  16. Thank You

More Related