Tapestry:A Resilient Global-Scale Overlay for Service Deployment

Tapestry:A Resilient Global-Scale Overlay for Service Deployment Zhao, Huang, Stribling, Rhea, Joseph, Kubiatowicz Presented by Rebecca Longmuir

What is Tapestry? • A peer-to-peer overlay routing infrastructure offering efficient, scalable, location-independent routing of message directly to nearby copies of an object or service using only localized resources. • An extensible infrastructure that provides decentralized object location and routing(DOLR)

What is DOLR? • Interface focusing on routing of message to endpoints such as nodes or object replicas. • Virtualizes resources, since endpoints are named by opaque identifiers encoding nothing about physical location • Allows for message delivery in instable underlying infrastructure. Developers only need to think about the dynamics of the network for optimization

Some background • Nodes participate in the overlay and are assigned nodeIDs uniformly at random from a large identifier space. • More than one node maybe be on one physical host. • Application-specific endpoints are assigned globally unique identifiers(GUIDs) select from the same identifier space.

More background • Node N has nodeID Nid and an object O has GUID OG. • Every message contains an application-specific identifier Aid

Four part DOLR networking API • PUBLISHOBJECT(OG,Aid): Publish or make available, object O on the local node. This is a best effort call and receives no confirmation • UNPUBLISHOBJECT(OG,Aid): Best effort attempt to remove location mappings for O • ROUTETOOBJECT(OG,Aid): Routes message to location of object with GUID OG • ROUTETONODE(N,Aid,Exact): Routes message to application Aid on node N. “Exact” specifies whether destination ID needs to be matched exactly to deliver payload

Routing and Object Location • Tapestry dynamically maps each identifier G to an Unique live node called the identifier's root GR • If node N exists with Nid = G then this is the root of G • Messages are delivered by using a routing table consisting of nodeIDs and IP addresses of nodes it communicates with referred to as neighbors. • Routing involves sending messages forward across neighbor links to nodes whose nodeIDs are progressively closer

Routing Mesh • Tapestry uses local tables at each node, called neighbor maps, to route overlay messages to the destination ID digit by digit • A node N has a neighbor map with multiple levels, where each level contains links to nodes matching a prefix up to a digit position in the ID and contains a number of entries equal to the ID’s base

Routing Mesh • The primary ith entry in the jth level is the ID and location of closest node that begins with the prefix(N,j-1)+i for example the 3259 is the ninth entry of the fourth level for 325AE • It is this idea of closest node that proves the locality properties of Tapestry

Routing Mesh • Since the nth hop shares a prefix of length >=n with the destination node, Tapestry looks in the (n+1)th level map for the entry matching the next digit in the destination ID • This guarantees that any existing node is reached in at most logBN logical hops where N is the namespace size and the IDs are of base B and assuming consistent neighbor maps. • If a digit cannot be matched than Tapestry looks for a close digit using ‘surrogate routing’ where each nonexistent ID is mapped to some live node with a similar ID

Providing resilience in the Routing Mesh • Make use of redundant routing paths • Each Primary neighbor is augmented by backup links sharing the same prefix • There are c x B pointers on a level(c is the number of neighbor links that differ only at the nth digit on the nth routing level) • The total size of the neighbor maps is c x B x logBN • The expected total number of back pointers is c x B x logBN

Object Publication and Location • Each root node inherits a unique spanning tree for routing • This is utilized to locate objects by sending out soft-state directory information across nodes

Object Publication and Location • A server S, storing an object O periodically publishes this object by routing a publish message toward OR • Each node along the publication path stores a pointer mapping • For replicas of an object that are on separate sever each server publishes its copy • Node location is stored for replicas in sorted order of network latency

Object Publication and Location • A client locates an object by routing a message to OR • Each node checks to see if it has location map for the object • If a node has a location map it send the request to the servicing node S, otherwise, it sends the message onward to OR

Object Publication and Location • Each hop toward the root reduces the number of nodes satisfying the next hop prefix constraint by a factor of the identifier base • The path to the root is a function of the destination ID only not the source ID • The closer a query gets to the object the more likely it is to cross the paths with the objects published path and the sooner it will reach the object

Dynamic Node Algorithms • There are many mechanisms in place to maintain consistency and to make sure objects are not lost • Many of the control messages require acknowledgments and are retransmitted as needed

Node Insertion (four components) • Need-to-know nodes are notified of N, because N fills a null entry in their routing tables • N might become the new object root for existing objects. References to those objects must be moved to N to maintain object availability • The algorithms must construct a near optimal routing table for N • Nodes near N are notified and may consider using N in their routing tables as an optimization

Node Insertion • Begins at N’s surrogate S (the “root” node that Nid maps to in the existing network) • S determines the longest prefix length it shares with Nid this length is p • S sends out an Acknowledge Multicast message to all existing nodes sharing the same prefix • The nodes that get the message add N to their routing tables and transfer references of locally rooted pointers as needed

Node Insertion • N’s initial neighbor set for it’s routing table are the nodes reached by the multicast • Beginning at level p N does an interactive nearest neighbor search • N uses the neighbor set to fill level p, trims this set to size k and requests that the k nodes send there backpointers • N decrements p and repeats the process until all levels are filled

Multiple nodes inserting at once • Every node A in the multicast keeps track of every node B that is still multicasting down to its neighbors • This lets any node C that is multicasting to A know of B’s existence • Also multicast keep track of the holes in the new node’s routing table and check their tables for any entries that can fill in these holes

Voluntary Node Deletion • When a node N leaves it tells all of the nodes making the set D that are it’s backpointers that it is leaving and of replacement nodes at each level from N’s routing table • The notified nodes send a republish object traffic to both N and its replacement • N routes reference of locally rooted objects to their new roots and lets D know when it is done

Involuntary Node Deletion • Tapestry improves object availability and routing in failure prone networks by building redundancy into the routing tables and object location references • Nodes send periodic beacons to detect outgoing links and node failures. • When these problems are notices repair of the routing mesh starts and redistribution and replication of the object location references begins • This is also helped by soft-state republishing of object references

Component Architecture • Transport layer provides the abstraction of communication channels from one overlay node to another • Neighbor Link provides secure but unreliable datagram facilities to layers above, including the fragmentation and reassembly of large messages • Neighbor Links also provides fault detection through keep alive message, plus latency and loss rate estimation • Neighbor Link optimizes message processing by parsing the message headers and only deserializing the message contents when required

Component Architecture • The router implements functionality unique to Tapestry • This layer includes the routing table and local object points • The router examines the destination GUID of the message that is receives and determines the next hop using the routing table and local object pointers • Message are passed back to the neighbor link layer for delivery

Component Architecture • Flow chart of the object location process • Also keep in mind the routing table and object pointer database are constantly changing as the network changes because of nodes entering or leaving or latency in the network

Tapestry Upcall Interface • To support functions that require greater control of details than the DOLR API can provide Tapestry supports an extensible upcall mechanism • Three primary calls provide the interaction between Tapestry and application handlers(G is a generic ID) • Deliver(G, Aid, Msg): Invoked on incoming messages destined for the local node • Forward(G,Aid,Msg): Invoked on incoming upcall-enabled messages • Route(G,Aid, Msg, NextHopNode): Invoked by the application handler to forward a message on to the NextHopNode

Tapestry Upcall Interface • Tapestry sends the message to the application via Forward(). The handler is responsible for calling Route() with the final destination. Finally, Tapestry invokes Deliver() on messages destined for the local node to complete routing

Implementation of a Tapestry Node • Tapestry is implemented as an event-driven system for high throughput and scalability • This requires an asynchronous I/O layer as well as an efficient model for internal communication and control between components

Implementation of a Tapestry Node • Network stage – is a combination of part of the transport layer and part of the neighbor link layer, providing neighbor communication that is not provided by the operating system. It also works with the Patchwork monitoring to measure loss rates and latency • Core router - utilizes the routing and object reference tables to handle application driven messages. It is the critical path for all messages entering or exiting the system

Implementation of a Tapestry Node • Node membership – is responsible for handling the integration of new nodes into the Tapestry mesh and the voluntary exit of nodes • Mesh Repair-responsible for adapting the mesh as the environment changes, including network failures, updating the routing table for network latency • Patchwork- uses soft-state beacons to probe outgoing links for reliability and performance allowing Tapestry to respond to failures and changes in the topology

Evaluation • Preformed on Several platforms • Microbenchmarks were run on local cluster, measure the large scale performance of a deployed Tapestry on the PlanetLab global testbed and make use of a local network simulation layer to support controlled repeatable experiments • Also should be noted to enable a wider variety of experiments multiple Tapestry node instance where place on each physical machine. Each instance only shares code not data. In some cases this leads to a decrease in the time to exchange message but is more demanding on the processor so the system is slowed in other ways

Performance in a Stable Network • Used micro benchmarks on a network of two nodes to isolate Tapestry’s message processing overhead. • The sender establishes a binary network with the receiver and sends 10,0001 messages fore each message size • The receiver measures the latency for each size using the interarrival time between the first and last messages

Microbenchmarks on Stable Tapestry • First they eliminated network delay to measure raw message processing by placing both nodes on different ports on the same machine. • To see how performance scaled with processor speed the test was preformed on different machines • For very small messages there is a dominant constant processing time. For messages larger than 2kB the cost of copying the data dominates and processing time becomes linear

Microbenchmarks on Stable Tapestry • Measurements of the routing throughput show that throughput is low for small messages where processing dominates but quickly increases as the messages increase in size • For the average 4kB message the P-IV can process 7100 messages/s and where as the P-III can process 3200 messages/s

Routing Overhead to Nodes and Objects • The RDP is computed for node routing by measuring all pairs roundtrip routing latencies between the 400 Tapestry instances used and dividing each by the corresponding ping round-trip time • Figure 13 shows that the median values for node-node routing RDP starts at ~3 and slowly decreases ~1 • The object RDP is measure as a ration of one-way Tapestry route to object latency, verses the one-way network latency • Figure 14 shows the RDP values sorted by their ping values and collect in 5ms bins with the 90th percentile and median values calculated per bin

Object Location Optimization • This figure was used to demonstrate that optimization can significantly lower the RDP observed by the bulk of all requesters for local-area network distance • Their technique trades extra storage space in the network for faster routing

Single Node Insertion • Measured the overhead required for a single node to join the Tapestry network, in terms of time required for the network and to stabilize and control message bandwidth during insertion • Figure 16 shows insertion time as a function of the network size. It shows that latencies scale sublinearly with the size of the network • For each datapoint a Tapestry network of size N is constructed and repeatedly a single node is inserted and deleted

Single Node Insertion • Figure 17 shows that the total bandwidth for a single node insertion scales logarithmically with the network size

Parallel Node Insertion • Started with a stable network of 200 nodes. Then repeated each parallel insert 20 times • They plotted the min, median and the 90th percentile. • There is significant variation in the 90th percentile, which is attributed to the effects of node virtualization

Continuous Convergence and Self-Repair • Instead of measuring latency, these tests focused on large-scale behavior under failures • The routing to nodes test measures the success rate of sending request to random keys in the namespace • The routing to objects test sends messages to previously published objects, located at servers which were guaranteed to stay alive in the network • Performance metrics include bandwidth and success rate of requests successfully reaching their destinations

Continuous Convergence and Self-Repair • For both figures 20% of the existing network is killed and after 15 minutes new nodes equal to 50% of the existing network are inserted • Only a small fraction of the request are affected when large portions of the network fail. • All massive failures and inserts lead to a small dip in success rate but quickly return to 100%

Continuous Convergence and Self-Repair • Each test included two churns of different levels of dynamicity • Constant change has little effect on Tapestry performance as success rates rarely fall even slightly below 100% • These dips happen independently of the parameters given to the churn

Comparison with other peer-to-peer systems • Unlike Gnutella Tapestry guarantees that queries will find existing objects • Similar to Chord and CAN in that they all scale well and guarantee that queries find existing objects under nonfailure conditions • Differ from Chord and CAN which take network distance into account when constructing their routing overlay. Tapestry instead constructs locally optimal routing tables from initialization and maintain them in order to reduce routing stretch • Tapestry allows applications to place objects according to their needs

Conclusion • Tapestry provides efficient and scalable routing of messages directly to nodes and objects in a large, sparse address space • Simulations show Tapestry performs near optimally under faults, while a small protion(~5%) of the queries fail on the faulty wide-area deployment Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John Kubiatowicz: IEEE Journal on Selected Areas in Communications, January 2004, Vol. 22, No. 1

Tapestry:A Resilient Global-Scale Overlay for Service Deployment