Self-stabilizing Overlay Networks

Self-stabilizing Overlay Networks Sukumar Ghosh University of Iowa (Talk at Michigan Technological University) Work in progress. Jointly with Andrew Berns and Sriram Pemmaraju

On Thursday, 16th August 2007 Skype had an outage (Skype is known to be a “self-healing” overlay network) (Skype’s explanation) The disruption was triggered by a massive restart of users’ computers across the globe within a very short timeframe, as they re-booted after receiving a routine set of patches through Windows Update.

Overlay Network A logical network laid on top of the Internet Logical link AB Logical link BC B A C Internet

The Formal Model Let V be a set of nodes. The functions id : V  Z+ assigns a unique id to each node in V rs : V  {0, 1}* assigns a random bit string to each node in V A family of overlay networks ON : F  G, where F is the set of all triples λ= (V; id; rs) and G is the set of all directed graphs. The family of overlay networks associates a unique directed graph ON(λ)∈ Gwith each labeled set λ = (V; id; rs) of nodes.

Structured vs. Unstructured Overlay networks Unstructured Structured Network topology satisfies specific invariants. Examples: Chord, CAN, Pastry Skip Graph etc No restriction on network topology. Examples: Gnutella, Kazaa, Bittorrent, Skype etc.

The Challenge Can an overlay network restore its correct functionality from an arbitrary initial configuration? Bad configurations can be caused by failures, perturbations, selfish actions, malicious attacks.

Autonomic Systems Self-management is the holy grail of all complex dynamic systems.

Self-stabilizing systems (Convergence) Recover from any arbitrary initial configuration to a legal configuration in a bounded number of steps, and (Closure) remain in the legal configuration thereafter, until another failure or perturbation occurs.

Self-stabilizing Overlay Networks Can an overlay network restore its topology from an arbitrary initial configuration? Does it make sense in unstructured networks? Does it make sense in structured networks?

Related work Self-stabilizing and Byzantine-tolerant overlay network. OPODIS 2007 [Dolev, Hoch, van Renesse] A distributed polylog time algorithm for self-stabilizing SKIP graph. PODC ’09 [Jacob, Richa, Scheideler et. al] Linearization: Locally self-stabilizing Sorting in graphs. ALENEX, SIAM ‘07 [Onus, Richa, Scheideler]

Example: Linearization (Onus, Richa, Scheideler, ALENEX 2007) 2 7 15 10 20 13 2 5 7 10 13 15 18 21 30 34 34 18 21 30 The ideal topology is a sorted list. The goal is to spontaneously recover to the ideal topology from an arbitrary connected topology

u=10 w1=2 w2=3 w3=6 w4=8 v1=19 v2=28 v3=30 v4=35 left neighbors right neighbors Self-stabilizing algorithm: Linearization • Left and right neighbors: • ‘w’ is left neighbor of node ‘u’ if {u, w}  E and w < u. • ‘w’ is right neighbor of node ‘u’ if {u, w}  E and u < w.

Self-stabilizing algorithm: Linearization • (The Algorithm) • In each round do • Convert left neighbors into sorted list • Convert right neighbors into sorted list u=10 w1=2 w2=3 w3=6 w4=8 v1=19 v2=28 v3=30 v4=35 Takes at most (n-2) rounds. Slide borrowed from Onus et al.

Evolution of Skip Graph(Aspenes, Shah SODA 2003) 2 4 9 15 32 47 63 80 93 107 Search time is O(n) hops

SKIP Graph 11 - 10 - Level 2 01 - 00 - 1 - - Level 1 0 - - Level 0 2 4 9 15 32 47 63 80 93 107 001 100 110 010 111 101 000 011 101 010 Node degree = O(log n), diameter = O(log n) Number of levels = O(log n), Search time now is O(log n) hops

SKIP Graph: the question Can we have a self-stabilizing skip graph that can spontaneously restore its topology starting from any “connected” initial configuration?

Why local checking is important Unless bad configurations are detected via local checking, periodic global snapshots are needed, which is disruptive for the system.

SKIP Graph is NOT locally checkable Self-stabilization requires local detection of errors, but certain failures are not locally checkable

SKIP+ graph Jacob, Richa, Scheideler et al. (PODC 2009) proposed a locally checkable version of SKIP Graph by adding a few extra edges to an existing Skip Graph. They called it a SKIP+ Graph. They presented an algorithm to stabilize such a topology in O(log2n) rounds with high probability. The algorithm is quite cumbersome. We try to devise a simpler and better solution.

Detectors detector detector detector detector detector detector Our first step

Detector diameter The detector diameter of G, is the maximum hop distance in G between any node and the closest detector.

Transitive Closure Framework Due to the local checkability property in any faulty configuration, there is at least one detector

Transitive Closure Framework Theorem For a SKIP+ graph, the detector diameter D =O(log n)

Transitive Closure Framework

Transitive Closure Framework The neighbors of each detector become detectors in the next round. In O(log n) rounds, every node becomes a detector, and these detectors initiate the transitive closure process. After an additionalO(log n) rounds, all nodes become connected with one another, and the topology becomes completely connected.

Transitive Closure Framework Afterall nodes becomes detectors and eventually the topology becomes completely connected, the nodes rebuild the correct topology using a REPAIR subroutine. REPAIR takes only one round.

The Repair Process Lemma If the network is completely connected and all nodes are detectors in round i, a legal overlay network will be built in round (i + 1), and no node will be a detector. Compare with Jacob et. al’s results

Local checkability Let L define a correct configuration of an overlay network. Then network is locally checkable when L = p0 ∧ p1 ∧ p2 ∧ … ∧ pn-1 where pi is a local predicate involving process i and its immediate neighbors only. Most of the real life networks are NOT locally checkable

Example: a clique c b a Theorem. A complete connected topology is locally checkable

Chord is not locally checkable Chord ring Loopy chord ring

CAN is not locally checkable Replace the black edges by the red edges, and each column becomes a loopy chord ring Content Addressable Network (CAN) on a 2D torus

LCON: a locally checkableoverlay network in a circular key space 0 3 5 59 54 7 50 N= 64 18 23 40 25 37 32

LCON: a locally checkableoverlay network in a circular key space 0 Nmax = s x d Let s=16, d=4 S-segment S-segment 3 5 59 54 7 S-links for node u: one edge to each node in the range (u to u+s mod N) 50 18 D-links for node u: Succ (u+s mod N), Succ (u+2s mod N) Succ (u+(d-1)s + mod N) S-segment S-segment 23 40 25 37 32

Observations Observation Each node in LCON has (d+s-2) neighbors. When d= s, the size of the neighborhood is O(sqrt N). Theorem The detector diameter of LCON is at most two.

Some properties of LCON Theorem. LCON is locally checkable. Main idea. Case 1. If the diameter is two, then every node can “see” every other node, and check if the topology is correct. Case 2. We show that if the diameter if greater than two, then there is at least one detector.

Self-stabilization of LCON The Transitive Closure Framework (TCF) will stabilize LCON in O(log N) time. But it may be a sledgehammer. What is the space complexity of stabilization using TCF?

Self-stabilization of LCON We have an algorithm customized for LCON that stabilizes LCON in polylog time, while the space complexity does not skyrocket to O(n)

Generalization of LCON Main idea Consider a CAN-like topology on a d-dimensional torus. Convert the “ring” in each dimension into an LCON ring. It is only partially shown in the figure on a 2-dimensional torus Each node has O(d.N1/2d) neighbors

Conclusion • A new problem of growing interest. We need efficient • algorithms for stabilizing a variety of overlay topologies. • The initial topology must be connected. Stabilization • from a partitioned topology is impossible. Also for a • given (V, id, rs) the legal topology should be unique. • Otherwise there will be an additional step for distributed • consensus • Working on extending this to more fragile networks.

Questions?

Self-stabilizing Overlay Networks