350 likes | 471 Views
Principles of Reliable Distributed Systems Tutorial 4: SkipNet. Spring 2009 Alex Shraer. Reading Material. SkipNet: A Scalable Overlay Network with Practical Locality Properties Harvey, Jones, Saroiu, Theimer, Wolman Microsoft Research. Reminder: DHT Advantages.
E N D
Principles of Reliable Distributed SystemsTutorial 4: SkipNet Spring 2009 Alex Shraer
Reading Material • SkipNet: A Scalable Overlay Network with Practical Locality PropertiesHarvey, Jones, Saroiu, Theimer, WolmanMicrosoft Research
Reminder: DHT Advantages • Peer-to-peer: no centralized control or infrastructure • Scalability: O(log N) routing, routing tables, join time • Load-balancing
DHT Disadvantages: SkipNet Motivation • No control where data is stored • Data may be stored far from its users • Data may be stored outside its domain • Local accesses leave local organization • In practice, organizations want: • Content Locality – explicitly place data where we want (inside the organization) • Path Locality – guarantee that local traffic (a user in the organization looks for a file of the organization) remains local • No prefix search • Search(key) returns file whose name is the closest prefix to key.
Practical Requirements • Data Controllability: • Organizations want control over their own data • Even if local data is globally available • Manageability: • Data control allows for data administration, provisioning and manageability
Practical Requirements (cont’d) • Security: • Content and path locality are key building blocks for dealing with certain external attacks (DoS, Traffic analysis) • Data availability • Local data survives network partitions. • Performance • Data can be stored near clients that use it
SkipNet Content Locality • Place files at nodes according to names • Name ID space (DNS-like) • for files and nodes • node name = reverse DNS name of the host (com.microsoft.host1) • file names have same prefix • Problem?
Constrained Load-Balancing • Data uniformly distributed in designated subset of nodes • e.g., inside organization
SkipNet’s Two Name Spaces Name ID Space com.microsoft.host1 non-uniform h(com.microsoft.host1) Numerical ID Space uniform
Skip List data structure (Pugh 99) 21 - • In-memory dictionary data structure. • Elements are stored in a sorted linked list: • Problem: Search, Insert, Delete take O(N) • N – number of nodes in the list. • Solution: a subset of nodes will have additional links to skip over many list elements • Perfect (deterministic) skip list: • Pointer at level h skips over 2h elements • Search: O (log N) • Insertion/deletion: expensive/awkward 9 21 head tail 26 9 21 17 6 3 25 26 9 12 19 21 7 17 6
Probabilistic Skip List 6 6 25 • Node at level h with probability 1/2h • Search, Insert, Delete: O (log N) w.h.p. head tail 6 17 9 25 3 25 26 9 12 19 21 7 17 6
Skip List: Good for Us? • The Good: • Sorted list: path locality for name-based search • O(log N) (w.h.p.) operations • The Bad: • Lookup starts from root only • Unequal load • nodes on the top levels have high chance to be in routing path
Ring 000 Ring 001 Ring 010 M Ring 011 Ring 100 Ring 101 O D Ring 110 Ring 111 A T X Z V M O D A T Ring 00 Ring 01 X Ring 10 Z Ring 11 V M O D A T Ring 1 X Ring 0 Z V M D O A T Root Ring Z V X SkipNet Global View L = 3 L = 2 L = 1 Level L = 0 The full SkipNet routing infrastructure for an 8 node system, including the ring labels.
SkipNet Structure • Skip Graph = Distributed Skip List • Every node belongs to rings at all levels • Search can start at any node • Use doubly linked lists at each level • Perfect vs. Probabilistic • Perfect : Pointers at level h point to nodes that are exactly 2h nodes to the left and right. • Probabilistic : A node in level h probabilistically determines which ring it belongs to.
M O D T A V X Z M D A T O V Z X M O D T A Z V X M D O T A Z V X SkipNet Routing Tables Ring 100 Ring 101 Ring 110 Ring 111 Ring 000 Ring 001 Ring 010 Ring 011 L = 3 Ring 01 Ring 00 Ring 10 Ring 11 L = 2 Node A’s Routing Table Ring 1 Ring 0 L = 1 Root Ring Level: L = 0
An Alternative View 010 110 101 M D O 000 001 A T Z V X 100 111 011 SkipNet nodes ordered by name ID. Routing tables of nodes A and V shown.
Routing By Name ID • Like search in a Skip List • Simple Rule: • Forward the message to node that is closest to destination, without going too far. • Route either clockwise/counterclockwise • Terminates when messages arrives at a node whose name ID is closest to destination. • Number of hops is O(log N) w.h.p.
M O D T A V X Z M D A T O V Z X M O D T A Z V X M D O T A Z V X Example: Routing from A to V Ring 100 Ring 101 Ring 110 Ring 111 Ring 000 Ring 001 Ring 010 Ring 011 L = 3 Ring 01 Ring 00 Ring 10 Ring 11 L = 2 Ring 1 Ring 0 L = 1 Root Ring Level: L = 0
M O D T A V X Z M D A T O V Z X M O D T A Z V X M D O T A Z V X Example: Routing from A to V Ring 100 Ring 101 Ring 110 Ring 111 Ring 000 Ring 001 Ring 010 Ring 011 L = 3 Ring 01 Ring 00 Ring 10 Ring 11 L = 2 Node T’s Routing Table Ring 1 Ring 0 L = 1 Root Ring Level: L = 0
M O D T A V X Z M D A T O V Z X M O D T A Z V X M D O T A Z V X Example: Routing from A to V Ring 100 Ring 101 Ring 110 Ring 111 Ring 001 Ring 010 Ring 011 Ring 000 L = 3 Ring 01 Ring 00 Ring 10 Ring 11 L = 2 Ring 1 Ring 0 L = 1 Root Ring Level: L = 0
E O D T A V X Z E D A T O V Z X E O D T A Z V X E D O T A Z V X • Route from A to F -> Terminates at E Example: Routing to Object Ring 100 Ring 101 Ring 110 Ring 111 Ring 001 Ring 010 Ring 011 Ring 000 L = 3 Ring 01 Ring 00 Ring 10 Ring 11 L = 2 Ring 1 Ring 0 L = 1 Root Ring Level: L = 0
Name ID Routing Algorithm Load Balancing // Invoked at all nodes (including the source and // destination nodes) along the routing path. RouteByNameID(msg) { // Forward along the longest pointer // that is between us and msg.nameID. h = localNode.maxHeight; while (h >= 0) { nbr = localNode.RouteTable[msg.dir][h]; if (LiesBetween(localNode.nameID, nbr.nameID, msg.nameID, msg.dir)) { SendToNode(msg, nbr); return; } h = h - 1; } // h<0 implies we are the closest node. DeliverMessage(msg.msg); } SendMsg(nameID, msg) { if( LongestPrefix(nameID,localNode.nameID)==0 ) msg.dir = RandomDirection(); else if( nameID<localNode.nameID ) msg.dir = counterClockwise; else msg.dir = clockwise; msg.nameID = nameID; RouteByNameID(msg); } Path Locality
Routing By Numeric ID • Numeric id’s are random, no ring is sorted by them • We can’t route top-down! • Bottom-up Routing • Routing begins at level 0 ring until a node is found whose numeric ID matches the destination numeric ID in the first digit. • Messages forwarded from ring in level h, Rh, to a ring in level h+1, Rh+1, such that nodes in Rh+1 share h+1 digits with destination numeric ID. • Terminates when message delivered, or none the nodes in Rh share h+1 digits with destination numeric ID
Foo.c Example: Routing by Numeric ID Ring 100 Ring 101 Ring 110 Ring 111 Ring 000 Ring 001 Ring 010 Ring 011 M O D L = 3 T A V Z X • Hash(“Foo.c”) = 101 M D Ring 01 Ring 00 Ring 10 Ring 11 A T L = 2 O V Z X M O D Ring 1 Ring 0 T L = 1 A Z V X M D O Root Ring T A Level: L = 0 Z V X
Routing by Numeric ID • The same routing tables are used for routing by nameID and numericID • The number of message hops is O(log N) whp • What sequential data structure does this search resemble?
Routing Algorithm // Invoked at all nodes (including the source and destination nodes) along the routing path. // Initially: msg.ringLvl = -1, msg.startNode = msg.bestNode = null & msg.finalDestination = false RouteByNumericID(msg) { if (msg.numID == localNode.numID || msg.finalDestination) { DeliverMessage(msg.msg); return; } if (localNode == msg.startNode) { // Done traversing current ring. msg.finalDestination = true; SendToNode(msg.bestNode); return; } h = CommonPrefixLen(msg.numID, localNode.numID); if (h > msg.ringLvl) { // Found a higher ring. msg.ringLvl = h; msg.startNode = msg.bestNode = localNode; } else if ( abs(localNode.numID - msg.numID) < abs(msg.bestNode.numID - msg.numID)) { // Found a better candidate for current ring. msg.bestNode = localNode; } // Forward along current ring. nbr = localNode.RouteTable[clockWise][msg.ringLvl]; SendToNode(nbr); }
Routing Summary • It all depends on how we look at the routing tables … • What is the data structure consisting of all the pointers in the rings that the specific node’s name ID belongs to? • A Skip List! Search is top-down. • What is the data structure consisting of all the rings in respect to searching by numeric id? • A Trie! Search is bottom-up. • The search in both directions takes O(log N) messages whp. • Ready for join/departure procedures?
Node Join • Two-stage process: (1) bottom-up + (2) top-down • Bottom-up: find the top level ring that matches the node’s numeric ID. • Top-down: build the new node’s routing table • Find a neighbor in the top ring using name ID search. • Starting from this neighbor, search for the name ID at the next lower level and thus find neighbors at lower level. • Repeated until the search reaches the root. • Update of the existing nodes’ routing tables: • after the new node has joined the root ring.
Node join illustrated Joining node Ring P1 Ring P0 Only a few in expectation Ring P
Node Join - Analysis • Key ideas: • Climb to a weakly populated ring. • Search for the node’s neighbors at the lower levels only after finding the neighbors at the higher levels. • The range of traversed nodes at the level = the range of neighbors at the next higher level. • Insertion traverses O(log N) hops whp • Expected O(log N) levels, constant number of neighbors at each level.
Node Departure/Failure • Graceful (notified) vs crash departure • Key issue –routing tables’ update • Key idea – separate vital info from optimizations • Routing is correct as long as the root level ring is maintained. • Other levels regarded as optimization hints • Does this remind something? • Upper-ring membership maintained through a background repair process.
Leaf Sets • Idea = use redundant pointers at level 0: • Protect from independent failures • Improve the search performance • Store L/2 pointers in every direction • SkipNet uses L=16 • Not an original SkipNet idea – used in Pastry.
Numeric Routing Name Routing Constrained Load Balancing (CLB) • Multiple DHTs with differing scopes using a single SkipNet structure • A result of the ability to route in both address spaces • Divide data object names into two parts with !CLB DomainCLB Suffix microsoft.com!skipnet.html • microsoft.com/skipnet.html! – controlled placement • !microsoft.com/skipnet.html – Global DHT
skipnet. html CLB Example com.microsoft • File ID = “com.microsoft!skipnet.html” • Route by name ID to com.microsoft • Inside com.microsoft, route by numeric ID to hash(“skipnet.html”) com.sun gov.irs edu.ucb
com.microsoft.research SkipNet Path Locality com.microsoft • Organizations correspond to contiguous SkipNet segments • Internal routing by NameID remains internal • Nodes have left / right pointers com.sun gov.irs edu.ucb