1 / 47

Structured Overlays - self-organization and scalability

Structured Overlays - self-organization and scalability. Acknowledgement: based on slides by Anwitaman Datta–Nanyang and Ali Ghodsi. S elf-organization. Self-organizing systems common in nature Physics, biology, ecology, economics, sociology, cybernatics Microscopic (local) interactions

paulpeters
Download Presentation

Structured Overlays - self-organization and scalability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structured Overlays- self-organization and scalability Acknowledgement: based on slides by Anwitaman Datta–Nanyang and Ali Ghodsi

  2. Self-organization • Self-organizing systems common in nature • Physics, biology, ecology, economics, sociology, cybernatics • Microscopic (local) interactions • Limited information, individual decisions • Distribution of control => decentralization • Symmetry in roles/peer-to-peer • Emergence of macroscopic (global) properties • Resilience • Fault tolerance as well as recovery • Adaptivity 2

  3. A Distributed Systems Perspective (P2P) • Centralized solutions undesirable or unattainable • Exploit resources at the edge • no dedicated infrastructure/servers • peers act as both clients and servers (servent) • Autonomous participants • large scale • dynamic system and workload • source of unpredictability • e.g., correlated failures • No global control or knowledge • rely on self-organization 3

  4. One solution: structured overlays/ distributed hash tables

  5. What’s a Distributed Hash Table? • An ordinary hash table • Every node provides alookupoperation • Given a key: return the associated value • Nodes keeprouting pointers • If item not found locally, route to another node , which is distributed

  6. Why’s that interesting? • Characteristic properties • Self-management in presence joins/leaves/failures • Routing information • Data items • Scalability • Number of nodes can be huge (to store a huge number of items) • However: search and maintenance costs scale sub-linearly (often logarithmically) with the number of nodes.

  7. short interlude applications

  8. node A Key Value node B /home/... 130.237.32.51 /usr/… 193.10.64.99 /boot/… 18.7.22.83 /etc/… 128.178.50.12 node C node D … … Global File System • Similar to DFS (eg NFS, AFS) • But files/metadata stored in directory • E.g. Wuala, WheelFS… • What is new? • Application logic self-managed • Add/remove servers on the fly • Automatic faliure handling • Automatic load-balancing • No manual configuration for these ops

  9. node A Key Value node B www.s... 130.237.32.51 www2 193.10.64.99 www3 18.7.22.83 cs.edu 128.178.50.12 node C node D … … P2P Web Servers • Distributed community Web Server • Pages stored in the directory • What is new? • Application logic self-managed • Automatically load-balances • Add/remove servers on the fly • Automatically handles failures • Example: • CoralCDN

  10. node A Key Value node B anwita 130.237.32.51 ali 193.10.64.99 alberto 18.7.22.83 ozalp 128.178.50.12 node C node D … … Name-based communication Pattern • Map node names to location • Can store all kinds of contact information • Mediator peers for NAT hole punching • Profile information • Used this way by: • Internet Indirection Infrastructure (i3) • Host Identity Payload (HIP) • P2P Session Initiation Protocol (P2PSIP)

  11. towards DHT construction consistent hashing

  12. Hash tables • Ordinary hash tables • put(key,value) • Store <key,value> in bucket (hash(key) mod 7) • get(key) • Fetch <key,v> s.t. <key,v> is in bucket (hash(key) mod 7) 0 1 2 3 4 5 6

  13. DHT by mimicking Hash Tables • Let each bucket be a server • n servers means n buckets • Problem • How do we remove or add buckets? • A single bucket change requires re-shuffling a large fraction of items

  14. 0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Consistent Hashing Idea • Logical name space, called the identifier space, consisting of identifiers {0,1,2,…, N-1} • Identifier space is a logical ring modulo N • Every node picks a random identifier • Example: • SpaceN=16 {0,…,15} • Five nodes a, b, c, d • a picks 6 • b picks 5 • c picks 0 • d picks 5 • e picks 2

  15. 0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Definition of Successor • The successorof an identifier is the first node met going in clockwise direction starting at the identifier • Example • succ(12)=14 • succ(15)=2 • succ(6)=6

  16. 0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Where to store items? • Use globally known hash function, H • Each item<key,value> getsthe identifierH(key) • Store item at successor of H(key) • Term: node is responsible for item k • Example • H(“Anwitaman”)=12 • H(“Ali”)=2 • H(“Alberto”)=9 • H(“Ozalp”)=14

  17. Consistent Hashing: Summary • + Scalable • Each node stores avg D/n items (for D total items, n nodes) • Reshuffle on avg D/n items for every join/leave/failure • - However: global knowledge - everybody knows everybody • Akamai works this way • Amazon Dynamo too • + Load balancing • w.h.p. O(log n) imbalance • Can eliminate imbalance by having each server ”simulate” log(n) random buckets 0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8

  18. towards dht construction giving up on global knowledge

  19. 0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Where to point (Chord)? • Each node points to its successor • The successor of a nodep is succ(p+1) • Known as a node’ssucc pointer • Each node points to its predecessor • First node met in anti-clockwise direction starting at n-1 • Known as a node’s pred pointer • Example • 0’s successor is succ(1)=2 • 2’s successor is succ(3)=5 • 5’s successor is succ(6)=6 • 6’s successor is succ(7)=11 • 11’s successor is succ(12)=0

  20. 0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 DHT Lookup • To lookup a keyk • CalculateH(k) • Followsuccpointers untilitemkis found • Example • Lookup”Alberto”at node 2 • H(”Alberto”)=9 • Traverse nodes: 2, 5, 6, 11 (BINGO) • Return “Trento” to initiator

  21. towards dht construction handling joins/leaves/failures

  22. 0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Dealing with failures • Each node keeps a successor-list • Pointer tofclosest successors • succ(p+1) • succ(succ(p+1)+1) • succ(succ(succ(p+1)+1)+1) • ... • Rule: If successor fails • Replace with closest alive successor • Rule: If predecessor fails • Set pred to nil • Set f=log(n) • With failure probability 0.5, w.h.p. all nodes in list will not fail: 1/2log(n)=1/n

  23. Handling Dynamism • Periodic stabilization used to make pointers eventually correct • Try pointing succ to closest alive successor • Try pointing pred to closest alive predecessor • Periodically at nodep: • set v:=succ.pred • if v≠nil and v is in (p,succ] • set succ:=v • send a notify(p) to succ • When receivingnotify(q)at nodep: • if pred=nil or q is in (pred,p] • set pred:=q

  24. Handling joins • When new nodenjoins • Findn’ssuccessor withlookup(n) • Setsuccton’ssuccessor • Stabilization fixes the rest 15 13 11 • Periodically at nodep: • set v:=succ.pred • if v≠nil and v is in (p,succ] • set succ:=v • send a notify(p) to succ • When receivingnotify(q)at nodep: • if pred=nil or q is in (pred,p] • set pred:=q

  25. Handling leaves • When n leaves • Just dissappear (like failure) • When pred detected failed • Set pred to nil • When succ detected failed • Set succ to closest alive in successor list 15 13 11 • Periodically at nodep: • set v:=succ.pred • if v≠nil and v is in (p,succ] • set succ:=v • send a notify(p) to succ • When receivingnotify(q)at nodep: • if pred=nil or q is in (pred,p] • set pred:=q

  26. 0 15 1 14 2 13 3 12 4 5 11 6 10 7 9 8 Speeding up lookups with fingers • If only pointer to succ(p+1) is used • Worst case lookup time is n, for n nodes • Improving lookup time (binary search) • Point to succ(p+1) • Point to succ(p+2) • Point to succ(p+4) • Point to succ(p+8) • … • Point to succ(p+2(log N)-1) • Distance always halved to the destination, log hops

  27. Handling Dynamism of Fingers and SList • Node p periodically: • Update fingers • Lookup p+21, p+22, p+23,…,p+2(log N)-1 • Update successor-list • slist := trunc(succ · succ.slist)

  28. Chord: Summary • Lookup hops is logarithmic in n • Fast routing/lookup like in a dictionary • Routing table size is logarithmic in n • Few nodes to ping

  29. Reliable Routing • Iterative lookup • Generally slower • Reliability easy to achieve • Initiator in full control • Recursive lookup • Generally fast (use established links) • Several ways to do reliability • End-to-end timeouts • Any node timeouts • Difficult to determine timeout value • .

  30. Replication of items • Successor-list replication (most systems) • Idea: replicate nodes • If node p responsible for set of items K • Replicate K on p’s immediate successors • Symmetric Replication • Idea: replicate identifiers • Items with key 0,16,32,48 equivalent • Whoever is responsible for 0, also stores 16,32,48 • Whoever is responsible for 16, also stores 0,32,48 • …

  31. towards proximity awareness plaxton-mesh (PRR) pastry/tapestry

  32. Plaxton Mesh [PRR] • Identifiers represented with radix/basek • Often k=16, hexadecimal radix • Ring size N is a large power of k, e.g. 1640

  33. 0* 1* 2* self 4* 5* 6* 7* 8* 9* a* b* c* d* e* f* 30* 31* 32* 33* 34* 35* 36* 37* 38* 39* self 3b* 3c* 3d* 3e* 3f* 3a0* 3a1* 3a2* 3a3* 3a4* 3a5* 3a6* self 3a8* 3a9* 3aa* 3ab* 3ac* 3ad* 3ae* 3af* 3a70* 3a71* 3a72* 3a73* 3a74* 3a75* 3a76* 3a77* 3a78* 3a79* 3a7a* 3a7b* 3a7c* 3a7d* 3a7e* self Plaxton Mesh (2) • Additional routing table on top of ring • Routing table construction by example • Node 3a7f keeps following routing table • Kleene star * for wildcards • Flexibility to choose proximate neighbors • Invariant: row i of any node in row i interchangeable

  34. Plaxton Routing • To route from 1234 to abcd: • 1234 uses rt row 1: jump to a*, eg a999 • a999 uses rt row 2: jump to ab*, eg ab11 • ab11 uses rt row 3: jump to abc*, eg abc0 • abc0 uses rt row 4: jump to abcd • Routing terminates in log(N) hops • In practise log(n), whereNis id size andn is number of nodes

  35. Pastry – extension to Plaxton mesh • Leaf set • Successor-list in both directions • Periodically gossiped to all leafs O(n2) [Bamboo] • Plaxton-mesh on top of ring • Failures in routing table • Get replacement from any node on same row • Routing • Route directly to responsible node in leaf set, otherwise • Route to closer (prefix) node, otherwise • Route on ring

  36. architecture of structured overlays a formal view of DHTs

  37. General Architecture for DHTs • Metric space S with distance function d • d(x,y)≥0 • d(x,x)=0 • d(x,y)=0  x=y • d(x,y) + d(y,z) ≤ d(x,z) • d(x,y)=d(y,x) (not always in reality) • Eg: • d(x,y) = y – x (mod N) Chord • d(x,y) = xxory Kademlia • d(x,y) = sqrt( (x1-y1)2 + … + (xd-yd)2 ) CAN

  38. Graph Embedding • Embed a virtual graph for routing • Powers of 2 (Chord) • Plaxton mesh (Pastry/Tapestry) • Hypercube • Butterfly (Viceroy) • A node responsible for many virtual identifiers (keys) • Eg Chord nodes responsible for all virtual ids between node id and predecessor

  39. XOR routing 39

  40. numerous optimizations

  41. joined Last contacted now A: time since last contacted Age U: known uptime • With Pareto session time: • Delete entry if < threshold Predicting routing entry liveness Timeline 41

  42. Evaluation: performance/cost tradeoff Performance Avg lookup latency (msec) Cost Bandwidth budget (bytes/node/sec) 42

  43. Comparing with parameterized DHTs Avg lookup latency (msec) Avg bandwidth consumed (bytes/node/sec) 43

  44. Convex hull outlines best tradeoffs Avg Lookup latency (msec) Avg bandwidth consumed (bytes/node/sec) 44

  45. Lowest latency for varying churn Avg lookup latency (msec) Fixed budget, Variable churn Median node session time (hours) • Accordion has lowest latency at low churn • Accordion’s latency increases slightly at high churn 45

  46. Accordion stays within budget Avg bandwidth (bytes/node/sec) Fixed budget, Variable churn Median node session time (hours) • Other protocols’ bandwidth increases with churn 46

  47. DHTs • Characteristic property • Self-manage responsibilities in presence: • Node joins • Node leaves • Node failures • Load-imbalance • Replicas • Basic structure of DHTs • Metric space • Embed graph with efficient search algo • Let each node simulate many virtual nodes

More Related