Peer-to-Peer Structured Overlay Networks

Peer-to-Peer Structured Overlay Networks Antonino Virgillito

Background Peer-to-peer systems • distribution • symmetry (communication, node roles) • decentralized control • self-organization • dynamicity

Data Lookup in P2P Systems • Data items spread over a large number of nodes • Which node stores which data item? • A lookup mechanism needed • Centralized directory -> bottleneck/single point of failure • Query Flooding -> scalability concerns • Need more structure!

More Issues • Organize, maintain overlay network • node arrivals • node failures • Resource allocation/load balancing • Resource location • Network proximity routing

What is a Distributed HashTable? • Exactly that  • A service, distributed over multiple machines, with hash table semantics • put(key, value), Value = get(key) • Designed to work in a peer-to-peer (P2P) environment • No central control • Nodes under different administrative control • But of course can operate in an “infrastructure” sense

What is a DHT? • Hash table semantics: put(key, value), Value = get(key) • Key is a single flat string • Limited semantics compared to keyword search • Put() causes value to be stored at one (or more) peer(s) • Get() retrieves value from a peer • Put() and Get() accomplished with unicast routed messages • In other words, it scales • Other API calls to support application, like notification when neighbors come and go

Distributed Hash Tables (DHT) nodes k1,v1 k2,v2 k3,v3 P2P overlay network Operations: put(k,v) get(k) k4,v4 k5,v5 k6,v6 • p2p overlay maps keys to nodes • completely decentralized and self-organizing • robust, scalable

Popular DHTs • Tapestry (Berkeley) • Based on Plaxton trees---similar to hypercube routing • The first* DHT • Complex and hard to maintain (hard to understand too!) • CAN (ACIRI), Chord (MIT), and Pastry (Rice/MSR Cambridge) • Second wave of DHTs (contemporary with and independent of each other)

DHTs Basics • Node IDs can be mapped to the hash key space • Given a hash key as a “destination address”, you can route through the network to a given node • Always route to the same node no matter where you start from • Requires no centralized control (completely distributed) • Small per-node state is independent of the number of nodes in the system (scalable) • Nodes can route around failures (fault-tolerant)

Things to look at • What is the structure? • How does routing work in the structure? • How does it deal with node joins and departures (structure maintenance)? • How does it scale? • How does it deal with locality? • What are the security issues?

The Chord Approach • Consistent Hashing • Logical Ring • Finger Pointers

The Chord Protocol • Provides: • A mapping successor: key -> node • To lookup key K, go to node successor(K) • successor defined using consistent hashing: • Key hash • Node hash • Both Keys and Nodes hash to same (circular) identifier space • successor(K)=first node with hash ID equal to or greater than hash(K)

Example: The Logical Ring Nodes 0, 1, 3 Keys 1, 2, 6

Consistent Hashing [Karger et al. ‘97] • Some Nice Properties: • Smoothness: minimal key movement on node join/leave • Load Balancing: keys equitably distributed over nodes

Mapping Details • Range of Hash Function • Circular ID space module 2m • Compute 160 bit SHA-1 hash, and truncate to m-bits • Chance of collision rare if m is large enough • Deterministic, but hard for an adversary to subvert

Successor/Predecessor in the Ring Finger Pointers n.finger[i] = successor (n+2 i-1) Each node knows more about portion of circle close to it! Chord State

Example: Finger Tables

Chord: routing protocol Notation n.foo( ) stands for a remote call to node n. • A set of nodes towards id are contacted remotely • Each node is queried for the known node which is closest to id • Process stops when a node is found having successor > id

Example: Chord Routing Finger Pointers for Node 1

Lookup Complexity • With high probability: O(log(N)) • Proof Intuition: • Being p the successor of the targeted key, distance to p reduces by at least half in each step • In m steps, would reach p • Stronger claim: In O(log(N)) steps, distance ≤ 2m/N Thereafter even linear advance will suffice to give O(log(N)) lookup complexity

Chord invariants • Every key in the network can be located as long as the following invariants are preserved after joins and leaves: • Each node’s successor is correctly maintained • For every key k, node successor(k) is responsible for k

Chord: Node Joins • New node B learns of at least one existing node A via external means • B asks A to lookup its finger-table information • Given that B’s hash-id is b, A does lookup for B.finger[i] = successor ( b + 2i-1) if interval not already included in finger[i-1] • B stores all finger information and sets up pred/succ pointers

Node Joins (contd.) • Update of finger table of existing nodes p such that: • p precedes b by at least 2i-1 • the i-th finger of node p succeeds b • Starts from p = predecessor( b - 2i-1 ) and proceeds in counter-clock-wise direction while 2. is true • Transferring keys: • Only from successor(b) to b • Must send notification to the application

Example: finger table update Node 6 joins

Example: transferring keys Node 1 leaves

Concurrent Joins/Leaves • Need a stabilization protocol to guard against inconsistency • Note: • Incorrect finger pointers may only increase latency, but incorrect successor pointers may cause lookup failure! • Nodes periodically run stabilization protocol • Finds successor’s predecessor • Repair if this isn’t self • This algorithm is also run at join

Example: node 25 joins

Example: node 28 joins before 20 stabilizes (1)

Example: node 28 joins before 20 stabilizes (2)

CAN • Virtual d-dimensionalCartesian coordinatesystem on a d-torus • Example: 2-d [0,1]x[1,0] • Dynamically partitionedamong all nodes • Pair (K,V) is stored bymapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P • Retrieve entry (K,V) by applying the same hash function to map K to P and retrieve entry from node in zone containing P • If P is not contained in the zone of the requesting node or its neighboring zones, route request to neighbor node in zone nearest P

Routing in a CAN • Follow straight line path through the Cartesian space from source to destination coordinates • Each node maintains a table of the IP address and virtual coordinate zone of each local neighbor • Use greedy routing to neighbor closest to destination • For d-dimensional space partitioned into n equal zones, nodes maintain 2d neighbors • Average routing path length:

CAN Construction • Joining node locates a bootstrapnode using the CAN DNS entry • Bootstrap node provides IP addressesof random member nodes • Joining node sends JOIN request torandom point P in the Cartesian space • Node in zone containing P splits thezone and allocates “half” to joining node • (K,V) pairs in the allocated “half” aretransferred to the joining node • Joining node learns its neighbor setfrom previous zone occupant • Previous zone occupant updates its neighbor set

Departure, Recovery and Maintenance • Graceful departure: node hands over its zone and the (K,V) pairs to a neighbor • Network failure: unreachable node(s) trigger an immediate takeover algorithm that allocate failed node’s zone to a neighbor • Detect via lack of periodic refresh messages • Neighbor nodes start a takeover timer initialized in proportion to its zone volume • Send a TAKEOVER message containing zone volume to all of failed node’s neighbors • If received TAKEOVER volume is smaller kill timer, if not reply with a TAKEOVER message • Nodes agree on neighbor with smallest volume that is alive

Pastry Generic p2p location and routing substrate • Self-organizing overlay network • Lookup/insert object in < log16N routing steps (expected) • O(log N) per-node state • Network proximity routing

Pastry: Object distribution 2128-1 O • Consistent hashing • 128 bit circular id space • nodeIds(uniform random) • objIds (uniform random) • Invariant: node with numerically closest nodeId maintains object objId nodeIds

Pastry: Object insertion/lookup 2128-1 O Msg with key X is routed to live node with nodeId closest to X Problem: complete routing table not feasible X Route(X)

Pastry: Routing table (# 65a1fc) Row 0 Row 1 Row 2 Row 3 log16 N rows

Pastry: Leaf sets • Each node maintains IP addresses of the nodes with the L/2 numerically closest larger and smaller nodeIds, respectively. • routing efficiency/robustness • fault detection (keep-alive) • application-specific local coordination

Pastry: Routing procedure if (destination is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (Rld exists) forward to Rld else forward to a known node that (a) shares at least as long a prefix (b) is numerically closer than this node

Pastry: Routing Properties • log16 N steps • O(log N) state d471f1 d467c4 d462ba d46a1c d4213f Route(d46a1c) d13da3 65a1fc

Pastry: Performance Integrity of overlay message delivery: • guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: • No failures: < log16N expected, 128/b + 1 max • During failure recovery: • O(N) worst case, average case much better

Pastry Join • X = new node, A = bootstrap, Z = nearest node • A finds Z for X • In process, A, Z, and all nodes in path send state tables to X • X settles on own table • Possibly after contacting other nodes • X tells everyone who needs to know about itself

Pastry Leave • Noticed by leaf set neighbors when leaving node doesn’t respond • Neighbors ask highest and lowest nodes in leaf set for new leaf set • Noticed by routing neighbors when message forward fails • Immediately can route to another neighbor • Fix entry by asking another neighbor in the same “row” for its neighbor • If this fails, ask somebody a level up

Peer-to-Peer Structured Overlay Networks