Distributed Hash Tables

Distributed Hash Tables Parallel and Distributed Computing Spring 2011 anda@cse.usf.edu

Distributed Hash Tables • Academic answer to p2p • Goals • Guaranteed lookup success • Provable bounds on search time • Provable scalability • Makes some things harder • Fuzzy queries / full-text search / etc. • Hot Topic in networking since introduction in ~2000/2001

DHT: Overview • Abstraction: a distributed “hash-table” (DHT) data structure supports two operations: • put(id, item); • item = get(id); • Implementation: nodes in system form a distributed data structure • Can be Ring, Tree, Hypercube, Skip List, Butterfly Network, ...

What Is a DHT? • A building block used to locate key-based objects over millions of hosts on the internet • Inspired from traditional hash table: • key = Hash(name) • put(key, value) • get(key) -> value • Challenges • Decentralized: no central authority • Scalable: low network traffic overhead • Efficient: find items quickly (latency) • Dynamic: nodes fail, new nodes join • General-purpose: flexible naming

The Lookup Problem N2 N1 N3 Put (Key=“title” Value=file data…) Internet ? Client Publisher Get(key=“title”) N4 N6 N5

DHTs: Main Idea N2 N1 N3 Client N4 Lookup(H(audio data)) Publisher Key=H(audio data) Value={artist, album title, track title} N6 N8 N7 N9

DHT: Overview (2) • Structured Overlay Routing: • Join: On startup, contact a “bootstrap” node and integrate yourself into the distributed data structure; get a node id • Publish: Route publication for file id toward a close node id along the data structure • Search: Route a query for file id toward a close node id. Data structure guarantees that query will meet the publication. • Fetch: Two options: • Publication contains actual file => fetch from where query stops • Publication says “I have file X” => query tells you 128.2.1.3 has X, use IP routing to get X from 128.2.1.3

From Hash Tables to Distributed Hash Tables Challenge: Scalably distributing the index space: • Scalability issue with hash tables: Add new entry => move many items • Solution: consistent hashing (Karger 97) Consistent hashing: • Circular ID space with a distance metric • Objects and nodes mapped onto the same space • A key is stored at its successor: node with next higher ID

DHT: Consistent Hashing Key 5 K5 Node 105 N105 K20 Circular ID space N32 N90 K80 A key is stored at its successor: node with next higher ID

What Is a DHT? • Distributed Hash Table: key = Hash(data) lookup(key) -> IP address put(key, value) get( key) -> value • API supports a wide range of applications • DHT imposes no structure/meaning on keys • Key/value pairs are persistent and global • Can store keys in other DHT values • And thus build complex data structures

Approaches • Different strategies • Chord: constructing a distributed hash table • CAN: Routing in a d-dimensional space • Many more… • Commonalities • Each peer maintains a small part of the index information (routing table) • Searches performed by directed message forwarding • Differences • Performance and qualitative criteria

DHT: Example - Chord • Associate to each node and file a unique id in an uni-dimensional space (a Ring) • E.g., pick from the range [0...2m-1] • Usually the hash of the file or IP address • Properties: • Routing table size is O(log N) , where N is the total number of nodes • Guarantees that a file is found in O(log N) hops

Example 1: Distributed Hash Tables (Chord) • Hashing of search keys AND peer addresses on binary keys of length m • Key identifier = SHA-1(key); Node identifier = SHA-1(IP address) • SHA-1 distributes both uniformly • e.g. m=8, key(“yellow-submarine.mp3")=17, key(192.178.0.1)=3 • Data keys are stored at next larger node key p peer with hashed identifier p, data with hashed identifier k k stored at node p such that p is the smallest node ID larger than k k predecessor m=832 keys storedat Search possibilities? 1. every peer knows every other O(n) routing table size2. peers know successor O(n) search cost p2 p3

DHT: Chord Basic Lookup N120 N10 “Where is key 80?” N105 N32 “N90 has K80” N90 K80 N60

DHT: Chord “Finger Table” 1/2 1/4 1/8 1/16 1/32 1/64 1/128 N80 • Entry i in the finger table of node n is the first node that succeeds or equals n + 2i • In other words, the ith finger points 1/2n-i way around the ring

DHT: Chord Join • Assume an identifier space [0..8] • Node n1 joins Succ. Table 0 i id+2i succ 0 2 1 1 3 1 2 5 1 1 7 6 2 5 3 4

DHT: Chord Join • Node n2 joins Succ. Table 0 i id+2i succ 0 2 2 1 3 1 2 5 1 1 7 6 2 Succ. Table i id+2i succ 0 3 1 1 4 1 2 6 1 5 3 4

DHT: Chord Join Succ. Table • Nodes n0, n6 join i id+2i succ 0 1 1 1 2 2 2 4 0 Succ. Table 0 i id+2i succ 0 2 2 1 3 6 2 5 6 1 7 Succ. Table i id+2i succ 0 7 0 1 0 0 2 2 2 6 2 Succ. Table i id+2i succ 0 3 6 1 4 6 2 6 6 5 3 4

DHT: Chord Join Succ. Table Items • Nodes: n0, n1, n2, n6 • Items: f7 f7 i id+2i succ 0 1 1 1 2 2 2 4 0 0 Succ. Table 1 i id+2i succ 0 2 2 1 3 6 2 5 6 7 6 2 Succ. Table i id+2i succ 0 7 0 1 0 0 2 2 2 Succ. Table i id+2i succ 0 3 6 1 4 6 2 6 6 5 3 4

DHT: Chord Routing Succ. Table Items • Upon receiving a query for item id, a node: • Checks whether stores the item locally • If not, forwards the query to the largest successor in its successor table that does not exceed id f7 i id+2i succ 0 1 1 1 2 2 2 4 0 0 Succ. Table 1 i id+2i succ 0 2 2 1 3 6 2 5 6 7 query(7) 6 2 Succ. Table i id+2i succ 0 7 0 1 0 0 2 2 2 Succ. Table i id+2i succ 0 3 6 1 4 6 2 6 6 5 3 4

DHT: Chord Summary • Routing table size? • Log N fingers • Routing time? • Each hop expects to 1/2 the distance to the desired id => expect O(log N) hops.

Load Balancing in Chord Network size n=10^4 5 10^5 keys

Length of Search Paths Network size n=2^12 100 2^12 keys Path length ½ Log2(n)

Chord Discussion • Performance • Search latency: O(log n) (with high probability, provable) • Message Bandwidth: O(log n) (selective routing) • Storage cost: O(log n) (routing table) • Update cost: low (like search) • Node join/leave cost: O(Log2 n) • Resilience to failures: replication to successor nodes • Qualitative Criteria • search predicates: equality of keys only • global knowledge: key hashing, network origin • peer autonomy: nodes have by virtue of their address a specific role in the network

Example 2: Topological Routing (CAN) • Based on hashing of keys into a d-dimensional space (a torus) • Each peer is responsible for keys of a subvolume of the space (a zone) • Each peer stores the addresses of peers responsible for the neighboring zones for routing • Search requests are greedily forwarded to the peers in the closest zones • Assignment of peers to zones depends on a random selection made by the peer

Network Search and Join Node 7 joins the network by choosing a coordinate in the volume of 1

CAN Refinements • Multiple Realities • We can have r different coordinate spaces • Nodes hold a zone in each of them • Creates r replicas of the (key, value) pairs • Increases robustness • Reduces path length as search can be continued in the reality where the target is closest • Overloading zones • Different peers are responsible for the same zone • Splits are only performed if a maximum occupancy (e.g. 4) is reached • Nodes know all other nodes in the same zone • But only one of the neighbors

CAN Path Length

Increasing Dimensions and Realities

CAN Discussion • Performance • Search latency: O(d n1/d), depends on choice of d (with high probability, provable) • Message Bandwidth: O(d n1/d), (selective routing) • Storage cost: O(d) (routing table) • Update cost: low (like search) • Node join/leave cost: O(d n1/d) • Resilience to failures: realities and overloading • Qualitative Criteria • search predicates: spatial distance of multidimensional keys • global knowledge: key hashing, network origin • peer autonomy: nodes can decide on their position in the key space

Comparison of some P2P Solutions

DHT Applications Not only for sharing music anymore… • Global file systems [OceanStore, CFS, PAST, Pastiche, UsenetDHT] • Naming services [Chord-DNS, Twine, SFR] • DB query processing [PIER, Wisc] • Internet-scale data structures [PHT, Cone, SkipGraphs] • Communication services [i3, MCAN, Bayeux] • Event notification [Scribe, Herald] • File sharing [OverNet]

DHT: Discussion • Pros: • Guaranteed Lookup • O(log N) per node state and search scope • Cons: • No one uses them? (only one file sharing app) • Supporting non-exact match search is hard

When are p2p / DHTs useful? • Caching and “soft-state” data • Works well! BitTorrent, KaZaA, etc., all use peers as caches for hot data • Finding read-only data • Limited flooding finds hay • DHTs find needles • BUT

A Peer-to-peer Google? • Complex intersection queries (“the” + “who”) • Billions of hits for each term alone • Sophisticated ranking • Must compare many results before returning a subset to user • Very, very hard for a DHT / p2p system • Need high inter-node bandwidth • (This is exactly what Google does - massive clusters)

Writable, persistent p2p • Do you trust your data to 100,000 monkeys? • Node availability hurts • Ex: Store 5 copies of data on different nodes • When someone goes away, you must replicate the data they held • Hard drives are *huge*, but cable modem upload bandwidth is tiny - perhaps 10 Gbytes/day • Takes many days to upload contents of 200GB hard drive. Very expensive leave/replication situation!

Research Trends: A Superficial History Based on Articles in IPTPS • In the early ‘00s (2002-2004): • DHT-related applications, optimizations, reevaluations… (more than 50% of IPTPS papers!) • System characterization • Anonymization • 2005-… • BitTorrent: improvements, alternatives, gaming it • Security, incentives • More recently: • Live streaming • P2P TV (IPTV) • Games over P2P

What’s Missing? • Very important lessons learned • …but did we move beyond vertically-integrated applications? • Can we distribute complex services on top of p2p overlays?

P2P: Summary • Many different styles; remember pros and cons of each • centralized, flooding, swarming, unstructured and structured routing • Lessons learned: • Single points of failure are very bad • Flooding messages to everyone is bad • Underlying network topology is important • Not all nodes are equal • Need incentives to discourage freeloading • Privacy and security are important • Structure can provide theoretical bounds and guarantees

Distributed Hash Tables