RON: Resilient Overlay Networks

RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris http://nms.lcs.mit.edu/ron/

Overlay Networks • An overlay network is a computer network which is built on top of another network. • Nodes are connected by virtual links. • Each correspond to a path in the underlying network • What if you want to experiment a new routing protocol? • What if you wanted a network that provides new capabilities that are valuable to some peers and applications? • Overlay networks are not new • Gnutella, Chord, Pastry, Kelips, VPN… • RON is an overlay upon underlying Internet

Why RON? • BGP scales well, but is not fault-tolerant • Detailed information only inside AS’s, information between AS’s is filtered and summarized • Some links are invisible, preventing BGP from doing a good decision • BGP’s fault recovery takes many minutes • 3min minimum detection + recovery time; often 15 mins (Labovitz 97-00) • 40% of outages took 30+ mins to repair (Labovitz 97-00) • 5% of faults last more than 2.75 hours (Chandra 01) • Link outages are common • 10% of routes available < 95% of the time (Labovitz 97-00) • 65% of routes available < 99.9% of the time (Labovitz 97-00) • 3.3% of all routes had serious problems (Paxson 95-97) • Route selection in BGP uses fixed and simple metrics

Motivation: Network Redundancy • Multiple paths exist between most hosts • Many paths are hidden due to private peering • Indirect paths may offer better performance • Non-transitive reachability • A and C can’t reach each other but B can reach them both • Try to exploit redundancy in underlying Internet

Motivation: Network Redundency

RON’s Goal • Fast failure detection and recovery • In seconds (reduced by a factor of 10) • Tighter integration with application • Fatal for one application may be acceptable for another • Optimize routes for latency, throughput, etc. • Expressive policy routing • Fine-grained policy specification • e.g. keep commercial traffic off Internet2

What RON can do? • Videoconferencing • Multi-person collaborations • Virtual Private Networks (VPNs) across public Internet • Branch offices of companies

What does RON do? • Small network: 3-50 nodes • Continuous measurement of each pair-wise link • A trade-off between scalabilityand recovery efficiency • Compute path properties • Based on different metrics, e.g. latency, loss rate… • Pick best path out of direct and indirect ones • One indirect hop is enough • Forward traffic over that path

Design Set of API’s used by a RON Client to interact with RON Router computes the forwarding tables(Link-state dissemination through RON) Receives and sends new packets, asks router for the best path, and also passively probes for performance data

Failure Detection • Active monitoring: send probes on each virtual link • Probe interval: 12 seconds • Probe timeout: 3 seconds • Routing update interval: 14 seconds • Passive measurement • Detect failure in under 20s • Faster than any TCP timeout

Policy Routing • RON allows users or administrators to define the types of traffic allowed on particular links • Traditionally, routing is based on destination and source addresses, but RON allows for routing based on other information • Router computes a forwarding table for each policy • Packets classified with policy tag and routed accordingly

RON overhead • Probe packet: 69 bytes • Probing and routing state traffic - grows O(N2) • Restricted size: one node in one site • To achieve 12~25 seconds recovery: • Reasonable overhead: 10% of the bandwidth of broadband Internet

Experiments • Real-world deployment of RON at several Internet sites. • RON1: 12 hosts in the US and Europe • 64 hours of measurements in March 2001 • RON2: 16 hosts • 85 hours of measurements in May 2001 • In RON1, outage detection and path selection mechanisms were successfully able to route 100% of outage situations, while RON2 achieved a 60% success rate.

Experiments: Loss Rate • Implemented in RON1 • Averaged over 30 mins • The samples detect unidirectional loss Reason: RON router uses bi-directional information to optimize uni-directional loss rates

Experiments: Latency • Implemented in RON1 • 5-mins average latencies • CDF

Experiments: Throughput • Implemented in RON1 • Totally 2035 samples • 5% doubled throughput, while 1% received < 50%

Experiments: Flooding Attack • On Utah Network Emulation Testbed • Attack begins at 5th second • Taking 13 seconds to reroute the connection • A receiver-side TCP sequence trace

Drawbacks • NAT • Naming: cache a “reply to” address/port pair • Two hosts are both behind NATs: treat as an outage, attempt to route around it • Violation of AUPs and BGP transit policies • RONs are small, this can be resolved at an administrative level

Thoughts • RON deals with failure recovery and let the Internet focus on scalability • Provide implementation details for their idea • Use overlay networks to solve the path failure detection and recovery problem • Overlay to network is like virtual machine to computer

Discussions • Is RON scalable? How many nodes can be in RON? • What’s the bad side of fine-grain policy routing? Can it work on usual PCs? • What happens if lots of overlay networks built on top of the Internet? • How does node distribution affect performance? • Why does RON cause an increase in the average latency in some fast paths? • Overhead • Why use link state but not distance vector in routing table built-up? • Speed of convergence • Size

DHT Distributed Hash Table

Flash Back • DHT = Distributed Hash Table • A distributed service that provides hash table semantics • Two fundamental operations: Insert, Lookup • Performance concerns • Operation complexity • Load balance • Locality • Maintenance • Fault tolerance • Scalability

Flash Back • Napster • O(1) message for every operation • Server becomes single point of failure • Gnutella • O(N) lookup latency • Chord • O(log(N)) lookup latency • Stabilization protocol not efficient

Pastry • Some similarity between Pastry and Chord • Use SHA-1 to generate node ID and message key value. • File is stored at the node whose id is close to its key • Data lookup, insertion takes O(log(N)) time • Some aspects Pastry outperforms Chord • Provides better locality than Chord • Better membership maintenance

Design of Pastry • Think id and key as a series of digits with base 2b • In each round, the message is one digit closer to the destination • Each Pastry node maintains three tables: • Routing table • Neighborhood set • Leaf set

Pastry – Routing Table • A (log2^b(N))*2b table • The (i,j) term is a node such that: • Share the same first (i-1) digits in their id • The ith digit of the node’s id is j • Typical value of b is 4 • Resembles finger table of Chord

Pastry – Neighbor Set & Leaf Set • Neighbor set M • |M| nodes that are closest (according to the proximity metric) to the host • Used to maintain locality properties • Leaf set L • |L| nodes that have closest id to the host • Divided into two sets: |L|/2 nodes with id greater than the host and |L|/2 nodes with id smaller than the host • Resembles successor/predecessor in Chord • Typical value of |L|,|M|: 2b or 2*2b

kth element differs from the node only by the last k bits => resembles routing table Numerically closest to the node => resembles leaf set Chord

Pastry

Pastry – Routing • Given a message with key k: • Check whether k is covered by the range of leaf set • If so, forward to the proper node and we are done • Check routing table to see if there is a node that is one digit closer to k • When the above check fails, forward to the numerically closest node in routing table, leaf set, and neighbor set. • Case 3 is unlikely • 2% when |L| = 2b, 0.6% when |L| = 2*2b • Expected routing steps is O(log2^b(N))

Pastry – Node Join • Suppose X wants to join and it knows a nearby node A • A uses routing protocol to find Z • Every node on the route from A to Z sends routing tables to X • X constructs its routing table according to the received ones • X uses the leaf set of Z and neighbor set of A as its own • X informs any node in its table

Pastry – Node Leave • Passive failure detection: no heartbeat! • Failure is discovered only when a node tries to send message to the failed node • If the leaving node is in the leaf set • Ask the largest/smallest node in the leaf set to obtain a replacement • If the leaving node is in the routing table • Ask nodes in the same row to obtain a replacement • If the leaving node is in the neighbor set • Not specified, but can be easily done by asking other nodes in the neighbor set

Pastry -- Locality • A node tends to select nearby nodes to put into its routing table • A joining node X gets its routing table from nodes on the route from A to Z, which are also close to X • To further improve locality, X obtains routing tables from its neighbors and updates its own

Pastry – Locality • There are fewer and fewer nodes as the number of level goes up • The optimal distance increases as the number of level goes up • Let the route from A to Z be A->B1->B2->…->Z • Bi will be a reasonable choice for the ith row of X since it is in the (i-1)th row of Bi-1

Pastry – Experiment on efficiency b = 4, |L| = 16, |M| = 32, 200,000 lookups

Pastry – Experiment on efficiency b = 4, |L| = 16, |M| = 32, N = 100,000, and 200,000 lookups

Pastry – Experiment on locality b = 4, |L| = 16, |M| = 32, and 200,000 lookups

Pastry – Experiment on locality SL: no locality WT: no 2nd stage WTF: with 2nd stage b = 4, |L| = 16, |M| = 32, N = 5,000

Pastry – Experiment on fault tolerance b = 4, |L| = 16, |M| = 32, N = 5,000 with 500 failing Only average over affected queries

Kelips • O(1) file lookup cost • O(√N) storage per node • Use gossip to implement multicast • Have very good resistance towards nodes fail • Only store metadata of files

Kelips • The nodes are divided evenly into √N affinity groups • Use SHA-1 to obtain id • Divide id by to determine which affinity group to join • Every node maintains the following tables • Affinity Group View: the set of nodes in the same affinity group • Contacts: a constant-sized (2 in the implementation) set of nodes in every affinity group • Filetuples: a set of tuples representing every file stored in the affinity group

Kelips – An Example

Kelips – Insertion • Insertion: • Obtain the hash value of the file and find the corresponding affinity group • Send the request to the closest contact of that affinity group • The contact randomly picks a member in the affinity group to store the metadata • The storing node uses gossiping to disseminate the message • O(1) complexity, O(log N) for gossiping

Kelips – Data Lookup • Data lookup: • Obtain the hash value of the file and find the corresponding affinity group • Send the request to the closest contact of that affinity group • The contact lookup its filetuple table and return the IP of the node holding the file • 2 message transmission time

Kelips – Maintenance • Affinity view, contact, and filetuple all require heartbeats to keep from expiring • Use gossiping to send heartbeats: • Gossiping messages consist of a fixed size of recently received information • May need to divide information into several packets • In every round, a node randomly chooses a constant-sized set to forward information • To improve locality, a node tends to choose nodes that are close to it • Incur constant traffic • Nodes join/leave is trivial to handle

Kelips – Experiment on Load Balance N = 1500 with 38 affinity groups Load balance is particular important in Kelips

Kelips – Experiment on Insertion N = 1000 with 30 affinity groups Note: No failure at all

Kelips -- Experiment N = 1000 with 30 affinity groups 500 nodes are deleted at time t = 1300

Comparison and Thought • Pastry requires O(log(N)) storage and O(log(N)) lookup complexity • Passive failure detection • Save bandwidth, but may not deal with frequent node join/leave • Security: what if some nodes are malicious • Keep redundant routing tables and randomly choose among them • Replicate data among numerically nearby nodes • Kelips uses O(√N) storage and O(1) lookup complexity • Sacrifice memory for efficiency • May not scale to a million of nodes • Size must know in advance • Not a serious issue if the size doesn’t change dramatically • Adaptive membership maintenance • Security • We can replicate metadata, but it will use up more bandwidth

RON: Resilient Overlay Networks