Peer-to-Peer Networks

Peer-to-Peer Networks Outline Overview Pastry OpenDHT Contributions from Peter Druschel & Sean Rhea CS 561

What is Peer-to-Peer About? • Distribution • Decentralized control • Self-organization • Symmetric communication • P2P networks do two things: • Map objects onto nodes • Route requests to the node responsible for a given object CS 561

Object Distribution 2128 - 1 O • Consistent hashing[Karger et al. ‘97] • 128 bit circular id space • nodeIds(uniform random) • objIds (uniform random) • Invariant: node with numerically closest nodeId maintains object objId nodeIds CS 561

Object Insertion/Lookup 2128 - 1 O Msg with key X is routed to live node with nodeId closest to X Problem: complete routing table not feasible X Route(X) CS 561

Routing: Distributed Hash Table d471f1 Properties • log16 N steps • O(log N) state d467c4 d462ba d46a1c d4213f Route(d46a1c) d13da3 65a1fc CS 561

Leaf Sets • Each node maintains IP addresses of the nodes with the L numerically closest larger and smaller nodeIds, respectively. • routing efficiency/robustness • fault detection (keep-alive) • application-specific local coordination CS 561

Routing Procedure if (D is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (RouteTab[l,d] exists) forward to RouteTab[l,d] else forward to a known node that (a) shares at least as long a prefix (b) is numerically closer than this node CS 561

Routing Integrity of overlay: • guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: • No failures: < log16N expected, 128/b + 1 max • During failure recovery: • O(N) worst case, average case much better CS 561

Node Addition d471f1 d467c4 d462ba d46a1c d4213f New node: d46a1c Route(d46a1c) d13da3 65a1fc CS 561

Node Departure (Failure) Leaf set members exchange keep-alive messages • Leaf set repair (eager): request set from farthest live node in set • Routing table repair (lazy): get table from peers in the same row, then higher rows CS 561

PAST: Cooperative, archival file storage and distribution • Layered on top of Pastry • Strong persistence • High availability • Scalability • Reduced cost (no backup) • Efficient use of pooled resources CS 561

PAST API • Insert - store replica of a file at k diverse storage nodes • Lookup - retrieve file from a nearby live storage node that holds a copy • Reclaim - free storage associated with a file Files are immutable CS 561

PAST: File storage fileId Insert fileId CS 561

k=4 fileId Insert fileId PAST: File storage Storage Invariant: File “replicas” are stored on k nodes with nodeIds closest to fileId (k is bounded by the leaf set size) CS 561

PAST: File Retrieval C k replicas Lookup file located in log16 N steps (expected) usually locates replica nearest client C fileId CS 561

DHT Deployment Today PAST (MSR/Rice) Overnet (open) i3 (UCB) CFS (MIT) OStore (UCB) pSearch (HP) PIER (UCB) Coral (NYU) KademliaDHT ChordDHT Pastry DHT TapestryDHT CANDHT KademliaDHT ChordDHT BambooDHT • Every application deploys its own DHT • (DHT as a library) IP connectivity CS 561

DHT Deployment Tomorrow? PAST (MSR/Rice) Overnet (open) i3 (UCB) CFS (MIT) OStore (UCB) pSearch (HP) PIER (UCB) Coral (NYU) KademliaDHT ChordDHT PastryDHT TapestryDHT CANDHT KademliaDHT ChordDHT BambooDHT DHT indirection OpenDHT: one DHT, shared across applications (DHT as a service) IP connectivity CS 561

Two Ways To Use a DHT • The Library Model • DHT code is linked into application binary • Pros: flexibility, high performance • The Service Model • DHT accessed as a service over RPC • Pros: easier deployment, less maintenance CS 561

The OpenDHT Service • 200-300 Bamboo [USENIX’04] nodes on PlanetLab • All in one slice, all managed by us • Clients can be arbitrary Internet hosts • Access DHT using RPC over TCP • Interface is simple put/get: • put(key, value) — stores value under key • get(key) — returns all the values stored under key • Running on PlanetLab since April 2004 • Building a community of users CS 561

OpenDHT Applications CS 561

Recognize Fingerprint? An Example Application: The CD Database Compute Disc Fingerprint Album & Track Titles CS 561

An Example Application: The CD Database Type In Album and Track Titles Album & Track Titles No Such Fingerprint CS 561

A DHT-Based FreeDB Cache • FreeDB is a volunteer service • Has suffered outages as long as 48 hours • Service costs born largely by volunteer mirrors • Idea: Build a cache of FreeDB with a DHT • Add to availability of main service • Goal: explore how easy this is to do CS 561

DHT DHT Cache Illustration New Albums Disc Fingerprint Disc Info Disc Fingerprint CS 561

Is Providing DHT Service Hard? • Is it any different than just running Bamboo? • Yes, sharing makes the problem harder • OpenDHT is shared in two senses • Across applications  need a flexible interface • Across clients  need resource allocation CS 561

Sharing Between Applications • Must balance generality and ease-of-use • Many apps want only simple put/get • Others want lookup, anycast, multicast, etc. • OpenDHT allows only put/get • But use client-side library, ReDiR, to build others • Supports lookup, anycast, multicast, range search • Only constant latency increase on average • (Different approach used by DimChord [KR04]) CS 561

Sharing Between Clients • Must authenticate puts/gets/removes • If two clients put with same key, who wins? • Who can remove an existing put? • Must protect system’s resources • Or malicious clients can deny service to others • The remainder of this talk CS 561

Fair Storage Allocation • Our solution: give each client a fair share • Will define “fairness” in a few slides • Limits strength of malicious clients • Only as powerful as they are numerous • Protect storage on each DHT node separately • Global fairness is hard • Key choice imbalance is a burden on DHT • Reward clients that balance their key choices CS 561

Two Main Challenges • Making sure disk is available for new puts • As load changes over time, need to adapt • Without some free disk, our hands are tied • Allocating free disk fairly across clients • Adapt techniques from fair queuing CS 561

Making Sure Disk is Available • Can’t store values indefinitely • Otherwise all storage will eventually fill • Add time-to-live (TTL) to puts • put (key, value)  put (key, value, ttl) • (Different approach used by Palimpsest [RH03]) CS 561

Client A arrives fills entire of disk Client B arrives asks for space Client A’s values start expiring time B Starves Making Sure Disk is Available • TTLs prevent long-term starvation • Eventually all puts will expire • Can still get short term starvation: CS 561

max Sum must be < max capacity Reserved for future puts. Slope = rmin TTL space size Candidate put 0 time now max Making Sure Disk is Available • Stronger condition: Be able to accept rmin bytes/sec new data at all times CS 561

max max TTL TTL size space space size 0 0 time time now now max max Making Sure Disk is Available • Stronger condition: Be able to accept rmin bytes/sec new data at all times CS 561

Queue full: reject put Per-client put queues Wait until can accept without violating rmin Select most under- represented Not full: enqueue put The Big Decision: Definition of “most under-represented” Fair Storage Allocation Store and send accept message to client CS 561

Client A arrives fills entire of disk Client B arrives asks for space B catches up with A Now A Starves! time Defining “Most Under-Represented” • Not just sharing disk, but disk over time • 1-byte put for 100s same as 100-byte put for 1s • So units are bytes  seconds, call them commitments • Equalize total commitments granted? • No: leads to starvation • A fills disk, B starts putting, A starves up to max TTL CS 561

Client A arrives fills entire of disk Client B arrives asks for space B catches up with A time A & B share available rate Defining “Most Under-Represented” • Instead, equalize rate of commitments granted • Service granted to one client depends only on others putting “at same time” CS 561

Defining “Most Under-Represented” • Instead, equalize rate of commitments granted • Service granted to one client depends only on others putting “at same time” • Mechanism inspired by Start-time Fair Queuing • Have virtual time, v(t) • Each put gets a start time S(pci) and finish time F(pci) F(pci) = S(pci) + size(pci)  ttl(pci) S(pci) = max(v(A(pci)) - , F(pci-1)) v(t) = maximum start time of all accepted puts CS 561

Fairness with Different Arrival Times CS 561

Fairness With Different Sizes and TTLs CS 561

Performance • Only 28 of 7 million values lost in 3 months • Where “lost” means unavailable for a full hour • On Feb. 7, 2005, lost 60/190 nodes in 15 minutes to PL kernel bug, only lost one value CS 561

Performance • Median get latency ~250 ms • Median RTT between hosts ~ 140 ms • But 95th percentile get latency is atrocious • And even median spikes up from time to time CS 561

The Problem: Slow Nodes • Some PlanetLab nodes are just really slow • But set of slow nodes changes over time • Can’t “cherry pick” a set of fast nodes • Seems to be the case on RON as well • May even be true for managed clusters (MapReduce) • Modified OpenDHT to be robust to such slowness • Combination of delay-aware routing and redundancy • Median now 66 ms, 99th percentile is 320 ms • using 2X redundancy CS 561

Peer-to-Peer Networks