An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services Jingyu Zhou *§, Lingkun Chu*, Tao Yang*§ * Ask Jeeves §University of California at Santa Barbara

Outline • Background & motivation • Membership protocol design • Implementation • Evaluation • Related work • Conclusion

Background • Large-scale 24x7 Internet services • Thousands of machines connected by many level-2 and level-3 switches (e.g. 10,000 at Ask Jeeves) • Multi-tiered architecture with data partitioning and replication • Some of machines are unavailable frequently due to failures, operational errors, and scheduled service update.

Network Topology in Service Clusters • Multiple hosting centers across Internet • In a hosting center • Thousands of nodes • Many level-2 and level-3 switches • Complex switch topology

Motivation • Membership protocol • Yellow page directory – discovery of services and their attributes • Server aliveness – quick fault detection • Challenges • Efficiency • Scalability • Fast detection

Fast Failure Detection is crucial • Online auction service even with replication • Failure of one replica 7s - 12s • Service unavailable 10s - 13s

Communication Cost for Fast Detection • Communication requirement • Propagate to all nodes • Fast detection needs higher packet rate • High bandwidth • Higher hardware cost • More chances of failures.

Design Requirements of Membership Protocol for Large-scale Clusters • Efficient: bandwidth, # of packets • Topology-adaptive: localize traffic within switches • Scalable: scale to tens of thousands of nodes • Fast failure detection and information propagation.

Approaches • Centralized • Easy to implement • Single point of failure, not scalable, extra delay • Distributed • All-to-all broadcast [Shen’01]: doesn’t scale well • Gossip [Renesse’98]: probabilistic guarantee • Ring: slow to handle multi-failures • Don’t consider network topology

TAMP: Topology-Adaptive Membership Protocol • Topology-awareness • Form a hierarchical tree according to network topology • Topology-adaptiveness • Network changes: add/remove/move switches • Service changes: add/remove/move nodes • Exploit TTL field in IP packet

Hierarchical Tree Formation Algorithm • Form small multicast groups with low TTL values; • Each multicast group performs elections; • Group leaders form higher level groups with larger TTL values; • Stop when max. TTL value is reached; otherwise, goto Step 2.

An Example • 3 Level-3 switches with 9 nodes

Node Joining Procedure • Purpose • Find/elect a leader • Exchange membership information • Process • Join a channel and listen; • If a leader exists, stop and bootstrap with the leader; • Otherwise, elects a leader (bully algorithm); • If is leader, increase channel ID & TTL, goto 1.

Properties of TAMP • Upward propagation guarantee • A node is always aware of its leader • Messages can always be propagated to nodes in the higher levels • Downward propagation guarantee • A node at levelimust leaders of level i-1, i-2, …, 0 • Messages can always be propagated to lower level nodes • Eventual convergence • View of every node converges

Update protocol when cluster structure changes • Heartbeat for failure detection • Leader receive an update - multicast up & down

Fault Tolerance Techniques • Leader failure: backup leader or election • Network partition failure • Timeout all nodes managed by a failed leader • Hierarchical timeout: longer timeout for higher levels • Packet loss • Leaders exchanges deltas since last update • Piggyback last three changes

Scalability Analysis • Protocols: all-to-all, gossip, and TAMP • Basic performance factors • Failure detection time (Tfail_detect) • View convergence time (Tconverge) • Communication cost in terms of bandwidth (B)

Scalability Analysis (Cont.) • Two metrics • BDP = B * Tfail_detect , lower failure detection time with low bandwidth is desired • BCP = B * Tconverge , lower convergence time with low bandwidth is desired n: total # of nodes k: each group size, a constant

Implementation • Inside Neptune middleware [Shen’01] – programming and runtime support for building cluster-based Internet services • Can be easily coupled into others clustering frameworks

Evaluation: Objectives & Settings • Metrics • Bandwidth • failure detection time • View convergence time • Hardware settings • 100 dual PIII 1.4GHz nodes • 2 switches connected by a Gigabit switch • Protocol related settings • Frequency: 1 packet/s • A node is deemed dead after 5 consecutive loss • Gossip mistake probability 0.1% • # of nodes: 20 – 100 in step of 20

Bandwidth Consumption • All-to-All & Gossip: quadratic increase • TAMP: close to linear

Failure Detection Time • Gossip: log(N) increase • All-to-All & TAMP: constant

View Convergence Time • Gossip: log(N) increase • All-to-All & TAMP: constant

Related Work • Membership & failure detection • [Chandra’96], [Fetzer’99], [Fetzer’01], [Neiger’96], and [Stok’94] • Gossip-style protocols • SCAMP, [Kempe’01], and [Renesse’98] • High-availability system (e.g., HA-Linux, Linux Heartbeat) • Cluster-based network services • TACC, Porcupine, Neptune, Ninja • Resource monitoring: Ganglia, NWS, MDS2

Contributions & Conclusions • TAMP is a highly efficient and scalable protocol for giant clusters • Exploiting TTL count in IP packet for topology-adaptive design. • Verified through property analysis and experimentation. • Deployed at Ask Jeeves clusters with thousands of machines.

Questions?

An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services