470 likes | 496 Views
Understand self-organization in various domains and its application in structured overlays for scalability. Explore distributed hash tables and consistent hashing for overcoming challenges in large-scale systems.
E N D
Structured Overlays- self-organization and scalability Acknowledgement: based on slides by Anwitaman Datta–Nanyang and Ali Ghodsi
Self-organization • Self-organizing systems common in nature • Physics, biology, ecology, economics, sociology, cybernatics • Microscopic (local) interactions • Limited information, individual decisions • Distribution of control => decentralization • Symmetry in roles/peer-to-peer • Emergence of macroscopic (global) properties • Resilience • Fault tolerance as well as recovery • Adaptivity 2
A Distributed Systems Perspective (P2P) • Centralized solutions undesirable or unattainable • Exploit resources at the edge • no dedicated infrastructure/servers • peers act as both clients and servers (servent) • Autonomous participants • large scale • dynamic system and workload • source of unpredictability • e.g., correlated failures • No global control or knowledge • rely on self-organization 3
What’s a Distributed Hash Table? • An ordinary hash table • Every node provides alookupoperation • Given a key: return the associated value • Nodes keeprouting pointers • If item not found locally, route to another node , which is distributed
Why’s that interesting? • Characteristic properties • Self-management in presence joins/leaves/failures • Routing information • Data items • Scalability • Number of nodes can be huge (to store a huge number of items) • However: search and maintenance costs scale sub-linearly (often logarithmically) with the number of nodes.
node A Key Value node B /home/... 130.237.32.51 /usr/… 193.10.64.99 /boot/… 18.7.22.83 /etc/… 128.178.50.12 node C node D … … Global File System • Similar to DFS (eg NFS, AFS) • But files/metadata stored in directory • E.g. Wuala, WheelFS… • What is new? • Application logic self-managed • Add/remove servers on the fly • Automatic faliure handling • Automatic load-balancing • No manual configuration for these ops
node A Key Value node B www.s... 130.237.32.51 www2 193.10.64.99 www3 18.7.22.83 cs.edu 128.178.50.12 node C node D … … P2P Web Servers • Distributed community Web Server • Pages stored in the directory • What is new? • Application logic self-managed • Automatically load-balances • Add/remove servers on the fly • Automatically handles failures • Example: • CoralCDN
node A Key Value node B anwita 130.237.32.51 ali 193.10.64.99 alberto 18.7.22.83 ozalp 128.178.50.12 node C node D … … Name-based communication Pattern • Map node names to location • Can store all kinds of contact information • Mediator peers for NAT hole punching • Profile information • Used this way by: • Internet Indirection Infrastructure (i3) • Host Identity Payload (HIP) • P2P Session Initiation Protocol (P2PSIP)
Hash tables • Ordinary hash tables • put(key,value) • Store <key,value> in bucket (hash(key) mod 7) • get(key) • Fetch <key,v> s.t. <key,v> is in bucket (hash(key) mod 7) 0 1 2 3 4 5 6
DHT by mimicking Hash Tables • Let each bucket be a server • n servers means n buckets • Problem • How do we remove or add buckets? • A single bucket change requires re-shuffling a large fraction of items
0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Consistent Hashing Idea • Logical name space, called the identifier space, consisting of identifiers {0,1,2,…, N-1} • Identifier space is a logical ring modulo N • Every node picks a random identifier • Example: • SpaceN=16 {0,…,15} • Five nodes a, b, c, d • a picks 6 • b picks 5 • c picks 0 • d picks 5 • e picks 2
0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Definition of Successor • The successorof an identifier is the first node met going in clockwise direction starting at the identifier • Example • succ(12)=14 • succ(15)=2 • succ(6)=6
0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Where to store items? • Use globally known hash function, H • Each item<key,value> getsthe identifierH(key) • Store item at successor of H(key) • Term: node is responsible for item k • Example • H(“Anwitaman”)=12 • H(“Ali”)=2 • H(“Alberto”)=9 • H(“Ozalp”)=14
Consistent Hashing: Summary • + Scalable • Each node stores avg D/n items (for D total items, n nodes) • Reshuffle on avg D/n items for every join/leave/failure • - However: global knowledge - everybody knows everybody • Akamai works this way • Amazon Dynamo too • + Load balancing • w.h.p. O(log n) imbalance • Can eliminate imbalance by having each server ”simulate” log(n) random buckets 0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8
0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Where to point (Chord)? • Each node points to its successor • The successor of a nodep is succ(p+1) • Known as a node’ssucc pointer • Each node points to its predecessor • First node met in anti-clockwise direction starting at n-1 • Known as a node’s pred pointer • Example • 0’s successor is succ(1)=2 • 2’s successor is succ(3)=5 • 5’s successor is succ(6)=6 • 6’s successor is succ(7)=11 • 11’s successor is succ(12)=0
0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 DHT Lookup • To lookup a keyk • CalculateH(k) • Followsuccpointers untilitemkis found • Example • Lookup”Alberto”at node 2 • H(”Alberto”)=9 • Traverse nodes: 2, 5, 6, 11 (BINGO) • Return “Trento” to initiator
0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Dealing with failures • Each node keeps a successor-list • Pointer tofclosest successors • succ(p+1) • succ(succ(p+1)+1) • succ(succ(succ(p+1)+1)+1) • ... • Rule: If successor fails • Replace with closest alive successor • Rule: If predecessor fails • Set pred to nil • Set f=log(n) • With failure probability 0.5, w.h.p. all nodes in list will not fail: 1/2log(n)=1/n
Handling Dynamism • Periodic stabilization used to make pointers eventually correct • Try pointing succ to closest alive successor • Try pointing pred to closest alive predecessor • Periodically at nodep: • set v:=succ.pred • if v≠nil and v is in (p,succ] • set succ:=v • send a notify(p) to succ • When receivingnotify(q)at nodep: • if pred=nil or q is in (pred,p] • set pred:=q
Handling joins • When new nodenjoins • Findn’ssuccessor withlookup(n) • Setsuccton’ssuccessor • Stabilization fixes the rest 15 13 11 • Periodically at nodep: • set v:=succ.pred • if v≠nil and v is in (p,succ] • set succ:=v • send a notify(p) to succ • When receivingnotify(q)at nodep: • if pred=nil or q is in (pred,p] • set pred:=q
Handling leaves • When n leaves • Just dissappear (like failure) • When pred detected failed • Set pred to nil • When succ detected failed • Set succ to closest alive in successor list 15 13 11 • Periodically at nodep: • set v:=succ.pred • if v≠nil and v is in (p,succ] • set succ:=v • send a notify(p) to succ • When receivingnotify(q)at nodep: • if pred=nil or q is in (pred,p] • set pred:=q
0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Speeding up lookups with fingers • If only pointer to succ(p+1) is used • Worst case lookup time is n, for n nodes • Improving lookup time (binary search) • Point to succ(p+1) • Point to succ(p+2) • Point to succ(p+4) • Point to succ(p+8) • … • Point to succ(p+2(log N)-1) • Distance always halved to the destination, log hops
Handling Dynamism of Fingers and SList • Node p periodically: • Update fingers • Lookup p+21, p+22, p+23,…,p+2(log N)-1 • Update successor-list • slist := trunc(succ · succ.slist)
Chord: Summary • Lookup hops is logarithmic in n • Fast routing/lookup like in a dictionary • Routing table size is logarithmic in n • Few nodes to ping
Reliable Routing • Iterative lookup • Generally slower • Reliability easy to achieve • Initiator in full control • Recursive lookup • Generally fast (use established links) • Several ways to do reliability • End-to-end timeouts • Any node timeouts • Difficult to determine timeout value • .
Replication of items • Successor-list replication (most systems) • Idea: replicate nodes • If node p responsible for set of items K • Replicate K on p’s immediate successors • Symmetric Replication • Idea: replicate identifiers • Items with key 0,16,32,48 equivalent • Whoever is responsible for 0, also stores 16,32,48 • Whoever is responsible for 16, also stores 0,32,48 • …
towards proximity awareness plaxton-mesh (PRR) pastry/tapestry
Plaxton Mesh [PRR] • Identifiers represented with radix/basek • Often k=16, hexadecimal radix • Ring size N is a large power of k, e.g. 1640
0* 1* 2* self 4* 5* 6* 7* 8* 9* a* b* c* d* e* f* 30* 31* 32* 33* 34* 35* 36* 37* 38* 39* self 3b* 3c* 3d* 3e* 3f* 3a0* 3a1* 3a2* 3a3* 3a4* 3a5* 3a6* self 3a8* 3a9* 3aa* 3ab* 3ac* 3ad* 3ae* 3af* 3a70* 3a71* 3a72* 3a73* 3a74* 3a75* 3a76* 3a77* 3a78* 3a79* 3a7a* 3a7b* 3a7c* 3a7d* 3a7e* self Plaxton Mesh (2) • Additional routing table on top of ring • Routing table construction by example • Node 3a7f keeps following routing table • Kleene star * for wildcards • Flexibility to choose proximate neighbors • Invariant: row i of any node in row i interchangeable
Plaxton Routing • To route from 1234 to abcd: • 1234 uses rt row 1: jump to a*, eg a999 • a999 uses rt row 2: jump to ab*, eg ab11 • ab11 uses rt row 3: jump to abc*, eg abc0 • abc0 uses rt row 4: jump to abcd • Routing terminates in log(N) hops • In practise log(n), whereNis id size andn is number of nodes
Pastry – extension to Plaxton mesh • Leaf set • Successor-list in both directions • Periodically gossiped to all leafs O(n2) [Bamboo] • Plaxton-mesh on top of ring • Failures in routing table • Get replacement from any node on same row • Routing • Route directly to responsible node in leaf set, otherwise • Route to closer (prefix) node, otherwise • Route on ring
General Architecture for DHTs • Metric space S with distance function d • d(x,y)≥0 • d(x,x)=0 • d(x,y)=0 x=y • d(x,y) + d(y,z) ≤ d(x,z) • d(x,y)=d(y,x) (not always in reality) • Eg: • d(x,y) = y – x (mod N) Chord • d(x,y) = xxory Kademlia • d(x,y) = sqrt( (x1-y1)2 + … + (xd-yd)2 ) CAN
Graph Embedding • Embed a virtual graph for routing • Powers of 2 (Chord) • Plaxton mesh (Pastry/Tapestry) • Hypercube • Butterfly (Viceroy) • A node responsible for many virtual identifiers (keys) • Eg Chord nodes responsible for all virtual ids between node id and predecessor
XOR routing 39
joined Last contacted now A: time since last contacted Age U: known uptime • With Pareto session time: • Delete entry if < threshold Predicting routing entry liveness Timeline 41
Evaluation: performance/cost tradeoff Performance Avg lookup latency (msec) Cost Bandwidth budget (bytes/node/sec) 42
Comparing with parameterized DHTs Avg lookup latency (msec) Avg bandwidth consumed (bytes/node/sec) 43
Convex hull outlines best tradeoffs Avg Lookup latency (msec) Avg bandwidth consumed (bytes/node/sec) 44
Lowest latency for varying churn Avg lookup latency (msec) Fixed budget, Variable churn Median node session time (hours) • Accordion has lowest latency at low churn • Accordion’s latency increases slightly at high churn 45
Accordion stays within budget Avg bandwidth (bytes/node/sec) Fixed budget, Variable churn Median node session time (hours) • Other protocols’ bandwidth increases with churn 46
DHTs • Characteristic property • Self-manage responsibilities in presence: • Node joins • Node leaves • Node failures • Load-imbalance • Replicas • Basic structure of DHTs • Metric space • Embed graph with efficient search algo • Let each node simulate many virtual nodes