350 likes | 497 Views
Koorde: A Simple Degree Optimal DHT. Frans Kaashoek, David Karger MIT Brought to you by the IRIS project. DHT Routing. Distributed hash tables Implement hash table interface Map any ID to the machine responsible for that ID (in a consistent fashion) Standard primitive for P2P
E N D
Koorde:A Simple Degree Optimal DHT Frans Kaashoek, David Karger MIT Brought to you by the IRIS project
DHT Routing • Distributed hash tables • Implement hash table interface • Map any ID to the machine responsible for that ID (in a consistent fashion) • Standard primitive for P2P • Machines not all aware of each other • Each tracks small set of “neighbors” • Route to responsible node via sequence of “hops” to neighbors
Performance Measures • Degree • How many neighbors nodes have • Hop count • How long to reach any destination node • Fault tolerance • How many nodes can fail • Maintenance overhead • E.g., making sure neighbors are up • Load balance • How evenly keys distribute among nodes
Tradeoffs • With larger degree, hope to achieve • Smaller hop count • Better fault tolerance • But higher degree implies • More routing table state per node • Higher maintenanceoverhead to keep routing tables up to date • Load balance “orthogonal issue”
Current Systems • Chord, Kademlia, Pastry, Tapestry • O(log n) degree • O(log n) hop count • O(log n) ratio load balance • Chord:O(1) load balance with O(log n) “virtual nodes” per real node • Multiplies degree to O(log2n)
Outliers • CAN • Degree d • O(dn1/d) hops • Viceroy • O(log n) hop count • Constant average degree • But some nodes have degree log n
Lower Bounds to Shoot For • Theorem: if max degree is d, then hop count is at least logdn • Proof: < dh nodes at distance h • Allows degree O(1) and O(log n) hops • Or deg. O(log n) and O(log n / loglog n) hops • Theorem: to tolerate half nodes failing, (e.g. net partition) need degree W(log n) • Pf: if less, some node loses all neighbors • Might as well take O(log n / loglog n) hops!
Koorde • New routing protocol • Shares almost all aspects with Chord • But, meets (to within constant factor) all lower bounds just mentioned: • Degree 2 and O(log n) hops • Or degree log n and O(log n / loglog n) hops and fault tolerant • Like Chord, O(log n) load balance • or constant with O(log n) times degree
Chord Review • Chord consists of • Consistent hashing to assign IDs to nodes • Good load balance • Efficient routing protocol to find right node • Fast join/leave protocol • Few data items shifted • Fault tolerance to half of nodes failing • Efficient maintenance over time ■Koorderouting protocol to find right node
Consistent Hashing 0 6 60 51 13 Assign doc with hash 49 to node 51 Assign ID to “successor” node on ring 49 18 47 22 42 36 31
Chord Routing • Each node keeps successor pointer • Also keeps power-of-two “fingers” neighbors providing shortcuts • So log n fingers 0 60 6 13 51 18 47 22 42 36 31
Chord Lookups 0 60 6 51 13 18 47 22 42 36 31
Koorde Idea • Chord acts like a hypercube • Fingers flip one bit • Degree log n (log n different flips) • Diameter log n • Koorde uses a deBruijn network • Fingers shiftin one bit • Degree 2 (2 possible bits to shift in) • Diameter log n
101 010 111 100 110 000 001 011 De Bruijn Graph • Nodes are b-bit integers (b = log n) • Node u has 2 neighbors (bit shifts): 2u mod 2b and 2u+1 mod 2b 0 0 0 1 0 1 0 1 0 1 0 0 1 1 1 1
101 010 111 100 001 011 De Bruijn Routing • Shift in destination bits one by one • b hops complete route • Route from 000 to 110: 0 110 0 0 1 0 1 0 000 1 0 1 0 0 1 1 1 1
Routing Code • Procedure u.LOOKUP(k, toShift) /* u is machine, k is target key toShift is target bits not yet shifted in */ if k = u then Return u /* as owner for k */ else /* do de Bruijn hop */ t = u°topBit(toShift) Return t.lookup(k, toshiftáá1) • Initially call self.LOOKUP(k,k)
Summary • Each node has 2 outgoing neighbors • Also two incoming • Can show good routing load balance • Need b = log n bits for n distinct nodes • So log n hops to route
Problems to Solve • Want b-bit ring, b >> log n, to avoid colliding identifiers as nodes join • Implies use b >> log n hops • Worse, most nodes not present to route! • Solutions • Imaginary routing: present nodes simulate routing actions of absent nodes • Short cuts: use gaps to start route with most of destination bits already shifted in
Imaginary routing • Node u holds two pointers • Successor on ring • One finger: predecessor of 2u (mod 2b) • On sparse ring, is also predecessor of 2u+1 • So handles both de Bruijn edges • Node u “owns” all imaginary nodes between self and (real) successor • Simulates de Bruijn routing from those imaginary nodes to others by forwarding to the others’ real owners
Code • Procedure u.LOOKUP(k, toShift, i) if k Î (u,u.successor] then return u.successor /* as bucket for k */ else if iÎ (u,u.successor] then /* i belongs to u; do de Bruijn hop */ return u.finger.LOOKUP(k, toshiftáá1, i°topBit(toShift)) else /* i doesn’t belong to u; forward it */ return u.successor.LOOKUP(k, toShift, i) • Initially call self.LOOKUP(k,k,self)
True route tracks imaginary start finger (< double) imaginary(double) target successor
Correctness • Once b de Bruijn steps happen, done • At this point, i = k • Will follow successors to bucket for k • Successor steps delay de Bruijn steps, but not forever • After finite number of successor steps, reach predecessor of i • Conclude: all necessary de Bruijn steps happen in finite time. So correct.
How long? • Only b de Bruijn steps • Just bound (expected) number of successor steps per de Bruijn step • Nodes randomly distributed on ring • So node expects to own size 1/n interval • So distance to imaginary node on de Bruijn step is 1/n • De Bruijn step doubles everything, makes distance 2/n • Expect 2 nodes in interval of that size
1/n Few Successor Steps start target < 2/n
Summary • Each de Bruijn hop followed by 2 successor hops (in expectation) • b de Bruijn hops • Conclude 2b successor hops so 3b hops in total • Expectation argument extends to “with high probability” argument (same bounds) • Remaining problem: b>>log n, too big
Exploit Address Blocks • Only n real nodes • Each owns ~1/n “block” of keyspace • Within that block, only top log n bits “significant”; low bits arbitrary • So set low bits to high bits of target • Then just have to shift out log n most significant bits • So log n de Bruijn hops, • So O(log n) hops in total
Example • Start at u = 001011011… • Successor 001110101…. • u “owns” imaginary 00101****** • Target 1101011…. • Set imaginary start 001011101011… • Only need to shift out 00101 • 5 hops, independent of b
Summary • Koorde uses • 2 neighbors per node • (one successor, one finger) • And requires O(log n) routing hops with high probability
101 010 002 111 100 110 121 000 120 122 021 020 102 022 112 001 012 011 Variant: Koorde-K • We used a binary de Bruijn Network • Generalizes to other base K: 2 1 0
Analysis • To represent n distinct node ids need logKn base-K digits • Suggests logKn hops to route • Same problem as Koorde: b >> logKn • Same solution: imaginary routing • Node u points at predecessor(Ku) • Same analysis: K de Bruijn hops interspersed with successor hops
Successor Hops • Now de Bruijn hop multiplies ids by K • So expect K nodes between finger and next imaginary node • Implies K successor hops per de Bruijn hop • Gives K logK n hops---no good • To avoid successor hops, u fingers predecessor(Ku) and following K nodes • Allows K successor hops by one finger • Gives O(logK n) hops as desired
Summary • Using K fingers per node, can achieve O(logKn) = O(log n / log K) routing hops • As discussed earlier, degree log nis necessary (and sufficient) for fault tolerance (and is degree of most previous systems) • So, O(log n / log log n ) hops
Summary: What do we Gain? • Lower degree for same number of hops • Storage isn’t really an issue • But lower degree should translate into lower maintenance traffic • Lower hop count for same degree • And tunable • Other systems also have tunable hop count • But at low hop counts (high degree) their extra log factor in degree does matter
What do we lose? • Chord is “self stabilizing” • From successors, can build entire routing system quickly by “pointer jumping” to find fingers • Koorde is not • Given only successor pointers, no clear fast way to find fingers • Not a problem for joins, because joiner can use lookup to find its finger • But could be a problem if massive changes
More Info http://www.pdos.lcs.mit.edu/chord/