Plaxton's Mesh: Dynamic and Reliable DHT for Distributed Systems

4.3 Pastry • A DHT based on “Plaxton’s Mesh” • Plaxton‘s idea: • Prefix-oriented search • Create copies of files in a distributed system • The number of nodes is fix. • Pastry: • Based on Plaxton’s mesh • Support dynamic changes in the set of nodes • Improvement in terms of reliability • Consideration of locality of nodes (proximity support)

Plaxton‘s Mesh (1) • The method of Plaxton, Rajamaran and Richa (1997) proposes to position the copies of the provided data (here: objects) in a manner so that minimal access time is achieved for all participants of the distributed system.

Plaxton‘s Mesh (2) • Objects as copies • Each object A has a unique identification of size log(m) bits. The data volume in bits of an object A is described by l(A). For each of the |V|=n nodes there exists a unique address as a bit string of length log(n). • Structure of the network • Nodes and objects to be stored have integer IDs B=2b, b ∈ℕ. For assigning objects to nodes the following rule applies: An object with ID A will be stored on the node with the ID that has the longest common prefix with A. If several such nodes exist the node with the suffix of a maximum bit match with A will be chosen. • This unique node is called root. • The intention of this allocation is to distribute the objects as evenly as possible over the nodes.

Plaxton‘s Mesh (3) • For routing, every node v∈ V has two tables: a pointer table and a neighbor table. • The entries in the pointer table of node u are triplets (A,v,k). The triplet indicates that object A is located on node v∈ V with a cost of c(u,v) ≤ k. • The pointer table contains a selection of all objects in the network. • In the following description of the neighbor table the parameter i refers to the length of the prefix of a node ID or object ID, the parameter j identifies a character of the base {0,1,…,B-1}. • The neighbor table of a node v contains entries of the following types: • Primary (i,j) Neighbors • The node u ∈ V whose ID equals the ID of v in a prefix length of i-1 characters and whose character at position i equals the ID j. If there exist several nodes with these properties, the one that minimizes the cost c(u,v) is selected. • If no such node exists the node with the longest common prefix with v is selected. • If this selection leads to more than one node, out of this set the node with the maxi-mum ID is selected.

Plaxton‘s Mesh (4) • Secondary (i,j) Neighbors • Let u be the primary (i,j)-neighbor of v and let the bit in position i of the ID of u equal j. As secondary (i,j)-neighbors a constant number d of nodes v‘ are chosen whose prefix matches the prefix of the ID of v up to the (i-1)st character and whose next character is j and who meet the condition c(v,v‘) ≤ d*c(v,u). • If this selected set consists of less than d elements the set will be expanded with the set of nodes v‘‘ for which the value c(v,v‘‘) is minimal. • Primary Reverse (i,j) Neighbors • A node is a primary reverse (i,j)-neighbor of v if u is a primary (i,j)-neighbor of v.

Plaxton‘s Mesh (5) • Results (no proof is given here) • The expected cost of a read operation is O(f(l(A))c(x,y)) if v is wants to read object A and u is the nearest node to v that holds a copy of A. • The expected cost of an insert operation is O(C) and that of a delete operation is O(c log n), where C=max{c(u,v): u,v ∈ V}. • Let q be the number of objects that can be stored in the main memory of each node. The size of the auxiliary memory at each node is O(q log2n). • The maximum number of nodes that are updated during insertion or deletion of an object is O( log n), and with high probability O( log2n).

xxx 0xx 1xx 2xx 3xx 00x 01x 02x 03x 10x 11x 12x 13x 20x 21x 22x 23x 30x 31x 32x 33x l=6: 6-bit identifiers b=2: base 4 key node managed by Pastry’s Topology • Identifier space • 2l-bit identifiers (typically: l = 128), wrap-around at 2l-1 ↔ 0 • interpret identifiers to the base of 2b (typically: b = 4, base 16) • build a prefix-based tree topology • The leaves can be keys and node IDs. • (key, value) pairs are managed by the numerically closest node. • Example: l=6: 6-bit identifiers, b=2: base is 4 • b=2: base 4 x0x1x2...xl-1 (x{0,1}) → x0x1x2...xl/b (x{0,1,...,2b-1})

xxx 0xx 1xx 2xx 3xx 00x 01x 02x 03x 10x 11x 12x 13x 20x 21x 22x 23x 30x 31x 32x 33x Routing Basics (1) • Goal: find the node responsible for k, e.g., k = 120 • Tree-based search for lookup(k) • Traverse tree search structure top-down • Prefix-based routing for lookup(k) • Approximate tree search in the distributed scenario • Forward query to the known node with the longest prefix matching k • In our example:

Routing Basics (2) • Routing in Pastry • In each routing step, the query is routed towards the numerically closest node • That is, the query is routed to a node with a one character (= b bits) longer prefix  routing steps • If that is not possible: • route towards the node that is numerically closest to the ID Destination: Start 1. Hop 2. Hop 3. Hop 4. Hop 5. Hop Destination: 012321 321321 022222 013331 012110 012300 012322 012321 (b = 2)

Routing Example (1) • Example • Node-ID = 0221 • Base = 3 (not a power of 2 because it is easier to draw) 0002 2012 0… 1… 2… 000. 200. 001. 100. 201. 101. 002. 202. 102. 00.. 10.. 01.. 11.. 20.. 21.. 0211 02.. 12.. 22.. 0221

Routing Example (2) • Data (key-value pairs) are maintained by the numerically closest node: • keys  nodes:0002 0002, 0110  01** 0002 2012 0… 1… 2… 000. 200. 001. 100. 201. 101. 002. 202. 102. 00.. 10.. 01.. 11.. 20.. 21.. 0211 02.. 12.. 22.. 0221

Links to Other Prefix Areas • Linking between prefix areas • Nodes within a certain prefix area also know the IP addresses of the other nodes in the same area. • Each node in a prefix area knows one or more nodes from another prefix area. • From which prefix areas should a node know other nodes? • Links to b-1 node areas on each prefix level

Linking Example • Node in area 222* knows nodes from the prefix areas 220*, 221* & 20**, 21** & 0***, 1*** • Thus, a logarithmic number of links: • For a prefix length p: (base-1) links to other nodes with prefix length p, but with a different digit at position p • l/b different prefix-lengths: l ~ log(N) 0… 1… 2… 20.. 21.. 22..

Where to Store The Routing Tables (1) • Challenge • Efficiently distribute the search tree among the nodes • Honor network proximity • Pastry routing data per node • Routing table • Long-distance links to other nodes • Leaf set • Numerically closest nodes • Neighborhood set • Close nodes based on a proximity metric (e.g., latency)

xxx 0xx 1xx 2xx 3xx 00x 01x 02x 03x 10x 11x 12x 13x 20x 21x 22x 23x 30x 31x 32x 33x Where to Store The Routing Tables (2) • Routing table • Long-distance links to other prefix areas • l/b rows: one per prefix length • 2b-1 columns: one per digit different from the local node ID • Routing table for node 120:

Amount of Routing Data (1) • rows with 2b-1 entries each • row i: hold the IDs of nodes whose ID shares an i-digit prefix with the node • column j: digit(i+1) = j • Contains topologically closest node that meets these criteria. • Example: b=2, N = 32, Node-ID = 32101 Digit at position i+1 Topologically closest node with prefix length i and digit(i+1)=j Shared prefix length with Node-ID Possible node: 33xyz 33123 is the topologically closest node

Amount of Routing Data (2) • Leaf set • contains the numerically closest nodes • has a fixed maximum size • similar to Chord's successor list • for routing and recovery from node departures • Neighbor set • contains nearby nodes • has a fixed maximum size • a scalar proximity metric is assumed to be available, e.g., IP hops, latency • irrelevant for routing • 'cache' of nearby candidates for the routing table Node ID = 32101 Larger NodeIDs Smaller NodeIDs 32100 32100 32023 32023 32110 32110 32121 32121 32012 32012 32022 32022 32123 32123 32120 32120

lookup(102332) → 102303 lookup(103300) → 103302 Long-Range Routing • When key k is not covered by the leaf set, the query for k is forwarded to • the node with a longer prefix match than the current node, or • the node with the same prefix length which is numerically closer. Node 103220

lookup(103312) 103302 Short-Range Routing • k is covered by the node IDs in the leaf set • pick the leaf node nLnumerically closest to k • nL must be responsible for k • return nL as answer to query for k Node 103220

Another Routing Example Key = 32102 Key = 01200 Key = 32200 Key = 33122 Common prefix: 32101 01200 -------- 0---- Common prefix: 32101 32200 -------- 322-- Common prefix: 32101 33122 -------- 33--- node is in range of leaf set Node-ID32101

Arrival of a New Node (1) • Node X wants to join the Pastry DHT • Determine Node-ID of X  12333 (hash of IP address) • Initialize tables at node X • Send JOIN message to 12333 via topologically nearest Pastry node A2 = 12222 X = 12333 JOIN X A4 = Z = 12332 A0 = 23231 A1 = 13231 A3 = 12311

Arrival of a New Node (2) • Node X wants to join the Pastry DHT • Node X copies neighbor set from node A0 A2 = 12222 X = 12333 A4 = Z = 12332 Copy Neighbor-Set A0 = 23231 A1 = 13231 A3 = 12311

A2 = 12222 X = 12333 A4 = Z = 12332 JOIN X A0 = 23231 A1 = 13231 A3 = 12311 Arrival of New Node (3) • Node X wants to join the Pastry DHT • Node A0 routes message to node Z • Each node sends row in routing table to X • Here A0

Arrival of New Node (4) • Node X wants to join the Pastry DHT • Node A0 routes message to node Z • Each node sends row in routing table to X • Here A1 A2 = 12222 X = 12333 A4 = Z = 12332 A0 = 23231 JOIN X A1 = 13231 A3 = 12311

Arrival of New Node (5) • Node X wants to join the Pastry DHT • Node A0 routes message to node Z • Each node sends row in routing table to X • Here A2 A2 = 12222 X = 12333 JOIN X A4 = Z = 12332 A0 = 23231 A1 = 13231 A3 = 12311

Arrival of a New Node (6) • Node X wants to join the Pastry DHT • Node A0 routes message to node Z • Each node sends row in routing table to X • Here A3 A2 = 12222 X = 12333 A4 = Z = 12332 A0 = 23231 JOIN X A1 = 13231 A3 = 12311

Arrival of a New Node (7) • Node X wants to join the Pastry DHT • Node A0 routes message to node Z • Each node sends row in routing table to X • Here A4 A2 = 12222 X = 12333 A4 = Z = 12332 JOIN X A0 = 23231 A1 = 13231 A3 = 12311

Arrival of a New Node (8) • Node X wants to join the Pastry DHT • Node Z copies its leaf set to node X A2 = 12222 X = 12333 A4 = Z = 12332 Copy Leaf-Set to X A0 = 23231 A1 = 13231 A3 = 12311

Arrival of a New Node (9) • Some entries are missing • take the nodeIDs just visited • Some are doubtable • Entries pointing to “own-ID positions” not required -- 12332 A2 = 12222 X = 12333 A4 = Z = 12332 A0 = 23231 A1 = 13231 A3 = 12311

Arrival of a New Node (10) • Node X wants to join the Pastry DHT • Node X sends its routing table to each neighbor 23231 A2 = 12222 X = 12333 A4 = Z = 12332 JOIN X A0 = 23231 A1 = 13231 A3 = 12311

Arrival of a New Node (11) • Efficiency of the initialization procedure • Quality of routing table (b=4, |L| = 16, |M| = 32, 5k nodes) SL: transfer only the ith routing table row of AiWT: transfer of ith routing table row of Ai as well as analysis of leaf and neighbor setWTF: same as WT, but also query the newly discovered nodes from WT and analysis data

Failure of a Pastry Node • Detection of failures • Periodic verification of nodes in the leaf set • “Are you alive” also checks the capability of the neighbors • A route query fails • Replacement of corrupted entries • leaf set • choose an alternative node from leaf (L)  leaf (|L|/2) • ask these nodes for their leaf sets • Entry Rxy in the routing table failed: • ask neighbor node Rxi(iy) of same row for route to Rxy • if not successful, test entry Rxi in the next row

Performance Evaluation • Routing Performance • Number of Pastry hops (b=4, |L| = 16, |M| = 32, 2·105 queries • O(log N) for number of hops in the overlay • Overhead of overlay (in com-parison to route between two nodes in the IP network) • But: routing table has only O(log N) entries instead of O(N)

Pastry: Summary • Complexity • hops to the destination with the long-distance routing table • often even better than that through the leaf and neighbor sets • A storage overhead of per node • Good support of locality • Explicit search of close nodes (according to some closeness metric) • Used in many applications • PAST (a file system) • Squirrel (a web cache)

Comparison: Complexity of Three DHTs

Conclusion – Distributed Hash Tables • The same hash function is used to label nodes, to assign keys to nodes and to search for keys • Lower search effort and memory requirements than unstructured peer-to-peer systems • O(log N) for Chord and Pastry • Chord and Pastry are similar • Pastry is better because of proximity support • CAN tends towards a linear search effort • but a memory requirement of only O(D) for D dimensions

Plaxton's Mesh: Dynamic and Reliable DHT for Distributed Systems

Plaxton's Mesh: Dynamic and Reliable DHT for Distributed Systems

Presentation Transcript

Pastry Making

PASTRY

Pastry

Pastry

Pastry

Pastry Chef

Pastry

Pastry Chefs

PASTRY

Pastry

Pastry

Pastry

PASTRY

Pastry

Pastry

Pastry

PASTRY

Pastry Baking

Indian Pastry

PASTRY

Pastry