Tapestry: Scalable and Fault-tolerant Routing and Location

Tapestry: Scalable and Fault-tolerant Routing and Location Stanford Networking SeminarOctober 2001 Ben Y. Zhaoravenben@eecs.berkeley.edu

Challenges in the Wide-area • Trends: • Exponential growth in CPU, storage • Network expanding in reach and b/w • Can applications leverage new resources? • Scalability: increasing users, requests, traffic • Resilience: more components  inversely low MTBF • Management: intermittent resource availability  complex management schemes • Proposal: an infrastructure that solves these issues and passes benefits onto applications Stanford Networking Seminar, 10/2001

Driving Applications • Leverage of cheap & plentiful resources: CPU cycles, storage, network bandwidth • Global applications share distributed resources • Shared computation: • SETI, Entropia • Shared storage • OceanStore, Gnutella, Scale-8 • Shared bandwidth • Application-level multicast, content distribution networks Stanford Networking Seminar, 10/2001

Key: Location and Routing • Hard problem: • Locating and messaging to resources and data • Goals for a wide-area overlay infrastructure • Easy to deploy • Scalable to millions of nodes, billions of objects • Available in presence of routine faults • Self-configuring, adaptive to network changes • Localize effects of operations/failures Stanford Networking Seminar, 10/2001

Talk Outline • Motivation • Tapestry overview • Fault-tolerant operation • Deployment / evaluation • Related / ongoing work Stanford Networking Seminar, 10/2001

What is Tapestry? • A prototype of a decentralized, scalable, fault-tolerant, adaptive location and routing infrastructure(Zhao, Kubiatowicz, Joseph et al. U.C. Berkeley) • Network layer of OceanStore • Routing: Suffix-based hypercube • Similar to Plaxton, Rajamaran, Richa (SPAA97) • Decentralized location: • Virtual hierarchy per object with cached location references • Core API: • publishObject(ObjectID, [serverID]) • routeMsgToObject(ObjectID) • routeMsgToNode(NodeID) Stanford Networking Seminar, 10/2001

Routing and Location • Namespace (nodes and objects) • 160 bits  280 names before name collision • Each object has its own hierarchy rooted at Root f (ObjectID) = RootID, via a dynamic mapping function • Suffix routing from A to B • At hth hop, arrive at nearest node hop(h) s.t. hop(h) shares suffix with B of length h digits • Example: 5324 routes to 0629 via5324  2349  1429  7629  0629 • Object location: • Root responsible for storing object’s location • Publish / search both route incrementally to root Stanford Networking Seminar, 10/2001

Publish / Lookup • Publish object with ObjectID: // route towards “virtual root,” ID=ObjectID For (i=0, i<Log2(N), i+=j) { //Define hierarchy • j is # of bits in digit size, (i.e. for hex digits, j = 4 ) • Insert entry into nearest node that matches onlast i bits • If no matches found, deterministically choose alternative • Found real root node, when no external routes left • Lookup object Traverse same path to root as publish, except search for entry at each node For (i=0, i<Log2(N), i+=j) { • Search for cached object location • Once found, route via IP or Tapestry to object Stanford Networking Seminar, 10/2001

3 4 2 NodeID 0x43FE 1 4 3 2 1 3 4 4 3 2 3 4 2 3 1 2 1 2 3 1 Tapestry MeshIncremental suffix-based routing NodeID 0x79FE NodeID 0x23FE NodeID 0x993E NodeID 0x43FE NodeID 0x73FE NodeID 0x44FE NodeID 0xF990 NodeID 0x035E NodeID 0x04FE NodeID 0x13FE NodeID 0x555E NodeID 0xABFE NodeID 0x9990 NodeID 0x239E NodeID 0x73FF NodeID 0x1290 NodeID 0x423E Stanford Networking Seminar, 10/2001

5712 7510 4510 3210 0880 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 4 4 4 4 4 5 5 5 5 5 7 7 7 7 7 3 3 3 3 3 6 6 6 6 6 Neighbor Map For “5712” (Octal) 0712 x012 xx02 xxx0 1712 x112 5712 xxx1 2712 x212 xx22 5712 3712 x312 xx32 xxx3 4712 x412 xx42 xxx4 5712 x512 xx52 xxx5 6712 x612 xx62 xxx6 7712 5712 xx72 xxx7 4 3 2 1 Routing Levels Routing in Detail Example: Octal digits, 212 namespace, 5712  7510 5712 0880 3210 4510 7510 Stanford Networking Seminar, 10/2001

Object LocationRandomization and Locality Stanford Networking Seminar, 10/2001

Fault-tolerant Location • Minimized soft-state vs. explicit fault-recovery • Redundant roots • Object names hashed w/ small salts  multiple names/roots • Queries and publishing utilize all roots in parallel • P(finding reference w/ partition) = 1 – (1/2)nwhere n = # of roots • Soft-state periodic republish • 50 million files/node, daily republish, b = 16, N = 2160 , 40B/msg, worst case update traffic: 156 kb/s, • expected traffic w/ 240 real nodes: 39 kb/s Stanford Networking Seminar, 10/2001

Fault-tolerant Routing • Strategy: • Detect failures via soft-state probe packets • Route around problematic hop via backup pointers • Handling: • 3 forward pointers per outgoing route (2 backups) • 2nd chance algorithm for intermittent failures • Upgrade backup pointers and replace • Protocols: • First Reachable Link Selection (FRLS) • Proactive Duplicate Packet Routing Stanford Networking Seminar, 10/2001

Summary • Decentralized location and routing infrastructure • Core routing similar to PRR97 • Distributed algorithms for object-root mapping, node insertion / deletion • Fault-handling with redundancy, soft-state beacons, self-repair • Decentralized and scalable, with locality • Analytical properties • Per node routing table size: bLogb(N) • N = size of namespace, n = # of physical nodes • Find object in Logb(n) overlay hops Stanford Networking Seminar, 10/2001

Deployment Status • Java Implementation in OceanStore • Running static Tapestry • Deploying dynamic Tapestry with fault-tolerant routing • Packet-level simulator • Delay measured in network hops • No cross traffic or queuing delays • Topologies: AS, MBone, GT-ITM, TIERS • ns2 simulations Stanford Networking Seminar, 10/2001

Evaluation Results • Cached object pointers • Efficient lookup for nearby objects • Reasonable storage overhead • Multiple object roots • Improves availability under attack • Improves performance and perf. stability • Reliable packet delivery • Redundant pointers approximate optimal reachability • FRLS, a simple fault-tolerant UDP protocol Stanford Networking Seminar, 10/2001

First Reachable Link Selection • Use periodic UDP packets to gauge link condition • Packets routed to shortest “good” link • Assumes IP cannot correct routing table in time for packet delivery IP Tapestry A B C DE No path exists to dest. Stanford Networking Seminar, 10/2001

Bayeux • Global-scale application-level multicast(NOSSDAV 2001) • Scalability • Scales to > 105 nodes • Self-forming member group partitions • Fault tolerance • Multicast root replication • FRLS for resilient packet delivery • More optimizations • Group ID clustering for better b/w utilization Stanford Networking Seminar, 10/2001

79FE 79FE 793E 79FE 793E 79FE 793E 793E Bayeux: Multicast 79FE Receiver 23FE 993E F9FE 43FE 73FE 44FE F990 29FE 035E 04FE 13FE 555E ABFE 9990 793E 239E 1290 093E 423E Multicast Root Receiver Stanford Networking Seminar, 10/2001

JOIN Receiver JOIN 43FE Receiver JOIN Bayeux: Tree Partitioning 79FE 23FE 993E F9FE 43FE 73FE 44FE F990 29FE 035E 04FE 13FE 555E ABFE 9990 Multicast Root 793E 239E 1290 093E 423E Stanford Networking Seminar, 10/2001 Multicast Root Receiver

Overlay Routing Networks • CAN:Ratnasamy et al., (ACIRI / UCB) • Uses d-dimensional coordinate space to implement distributed hash table • Route to neighbor closest to destination coordinate • Chord:Stoica, Morris, Karger, et al., (MIT / UCB) • Linear namespace modeled as circular address space • “Finger-table” point to logarithmic # of inc. remote hosts • Pastry:Rowstron and Druschel (Microsoft / Rice ) • Hypercube routing similar to PRR97 • Objects replicated to servers by name Fast Insertion / Deletion Constant-sized routing state Unconstrained # of hops Overlay distance not prop. to physical distance Simplicity in algorithms Fast fault-recovery Log2(N) hops and routing state Overlay distance not prop. to physical distance Fast fault-recovery Log(N) hops and routing state Data replication required for fault-tolerance Stanford Networking Seminar, 10/2001

Ongoing Research • Fault-tolerant routing • Reliable Overlay Networks (MIT) • Fault-tolerant Overlay Routing (UCB) • Application-level multicast • Bayeux (UCB), CAN (AT&T), Scribe and Herald (Microsoft) • File systems • OceanStore (UCB) • PAST (Microsoft / Rice) • Cooperative File System (MIT) Stanford Networking Seminar, 10/2001

For More Information Tapestry: http://www.cs.berkeley.edu/~ravenben/tapestry OceanStore: http://oceanstore.cs.berkeley.edu Related papers: http://oceanstore.cs.berkeley.edu/publications http://www.cs.berkeley.edu/~ravenben/publications ravenben@cs.berkeley.edu Stanford Networking Seminar, 10/2001

Tapestry: Scalable and Fault-tolerant Routing and Location

Tapestry: Scalable and Fault-tolerant Routing and Location

Presentation Transcript

Scalable Routing in Delay Tolerant Mobile Networks

Fault Tolerant WSN Routing

Fault-Tolerant Routing: A Genetic Algorithm and CJC

Scalable, Fault-tolerant Management of Grid Services

Scalable Routing in Delay Tolerant Mobile Networks

Fault Tolerant Facility Location

Scalable Dynamic Analysis for Automated Fault Location and Avoidance

Fault Tolerant Multi-path Enhanced Routing Algorithm

Scalable Routing In Delay Tolerant Networks

HARNESS and Fault Tolerant MPI

Fault-Tolerant Facility Location

Replication and Fault Tolerant

Tapestry: Decentralized Routing and Location

Competitive fault tolerant Distance Oracles and Routing Schemes

Tapestry: Scalable and Fault-tolerant Routing and Location

Tapestry Deployment and Fault-tolerant Routing

Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing

Tapestry: Decentralized Routing and Location

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Tapestry: Wide-area Location and Routing

Tapestry Deployment and Fault-tolerant Routing