An Overlay Infrastructure for Decentralized Object Location and Routing

An Overlay Infrastructure for Decentralized Object Location and Routing Ben Y. Zhaoravenben@cs.ucsb.edu University of California at Santa Barbara

Structured Peer-to-Peer Overlays • Node IDs and keys from randomized namespace (SHA-1) • incremental routing towards destination ID • each node has small set of outgoing routes, e.g. prefix routing • log (n) neighbors per node, log (n) hops between any node pair ID: ABCE ABC0 To: ABCD AB5F A930 ravenben@cs.ucsb.edu

Related Work • Unstructured Peer to Peer Approaches • Napster, Gnutella, KaZaa • probabilistic search (optimized for the hay, not the needle) • locality-agnostic routing (resulting in high network b/w costs) • Structured Peer to Peer Overlays • the first protocols (2001): Tapestry, Pastry, Chord, CAN • then: Kademlia, SkipNet, Viceroy, Symphony, Koorde, Ulysseus… • distinction: how to choose your neighbors • Tapestry, Pastry: latency-optimized routing mesh • distinction: application interface • distributed hash table: put (key, data); data = get (key); • Tapestry: decentralized object location and routing ravenben@cs.ucsb.edu

Chord • NodeIDs are numbers on ring • Closeness defined by numerical proximity • Finger table • keep routes for next node 2i away in namespace • routing table size: log2 n • n = total # of nodes • Routing • iterative hops from source • at most log2 n hops Node 0/1024 0 128 896 256 768 640 384 512 ravenben@cs.ucsb.edu

Chord II • Pros • simplicity • Cons • limited flexibility in routing • neighbor choices unrelated to network proximity* but can be optimized over time • Application Interface: • distributed hash table (DHash) ravenben@cs.ucsb.edu

Tapestry / Pastry • incremental prefix routing • 11110XXX00XX 000X0000 • routing table • keep nodes matching at least i digits to destination • table size: b * logb n • routing • recursive routing from source • at most logb n hops Node 0/1024 0 128 896 256 768 640 384 512 ravenben@cs.ucsb.edu

2175 0157 0154 0123 0880 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 4 4 4 4 4 5 5 5 5 5 7 7 7 7 7 3 3 3 3 3 6 6 6 6 6 Neighbor Map For “2175” (Octal) 0xxx 20xx 210x 2170 1xxx ---- 211x 2171 ---- 22xx 212x 2172 3xxx 23xx 213x 2173 4xxx 24xx 214x 2174 5xxx 25xx 215x ---- 6xxx 26xx 216x 2176 7xxx 27xx ---- 2177 4 3 2 1 Routing Levels Routing in Detail Example: Octal digits, 212 namespace, 2175  0157 2175 0880 0123 0154 0157 ravenben@cs.ucsb.edu

Tapestry / Pastry II • Pros • large flexibility in neighbor choicechoose nodes closest in physical distance • can tune routing table size and routing hops using parameter b • Cons • more complex than Chord to implement / understand • Application Interface • Tapestry: decentralized object location • Pastry: distributed hash table ravenben@cs.ucsb.edu

Talk Outline • Motivation and background • What makes Tapestry different • Tapestry deployment performance • Wrap-up ravenben@cs.ucsb.edu

So What Makes Tapestry Different? • It’s all about performance • Proximity routing • leverage flexibility in routing rules • for each routing table entry, choose node • that satisfies prefix requirement • and is closest in network latency • result: end to end latency “proportional” to actual IP latency • DOLR interface • applications choose where to place objects • use application-level knowledge to optimize access time ravenben@cs.ucsb.edu

Why Proximity Routing? • Fewer/shorter IP hops: shorter e2e latency, less bandwidth/congestion, less likely to cross broken/lossy links 01234 01234 ravenben@cs.ucsb.edu

Performance Impact (Proximity) • Simulated Tapestry w/ and w/o proximity on 5000 node transit-stub network • Measure pair-wise routing stretch between 200 random nodes ravenben@cs.ucsb.edu

backbone Decentralized Object Location & Routing routeobj(k) • redirect data traffic using log(n) in-network redirection pointers • average # of pointers/machine: log(n) * avg files/machine • keys to performance • proximity-enabled routing mesh with routing convergence routeobj(k) k publish(k) k ravenben@cs.ucsb.edu

DOLR vs. Distributed Hash Table • DHT: hash content  name  replica placement • modifications  replicating new version into DHT • DOLR: app places copy near requests, overlay routes msgs to it ravenben@cs.ucsb.edu

Performance Impact (DOLR) • simulated Tapestry w/ DOLR and DHT interfaces on 5000 node T-S • measure route to object latency from clients in 2 stub networks • DHT: 5 object replicas DOLR: 1 replica placed in each stub network ravenben@cs.ucsb.edu

0120 00XX 010X 0121 1XXX 011X 02XX 0122 2XXX 013X 3XXX 03XX XXXX 0XXX 01XX 012X ID = 0123 Weaving a Tapestry • inserting node (0123) into network • route to own ID, find 012X nodes, fill last column • request backpointers to 01XX nodes • measure distance, add to rTable • prune to nearest K nodes • repeat 2—4 Existing Tapestry ravenben@cs.ucsb.edu

Talk Outline • Motivation and background • What makes Tapestry different • Tapestry deployment performance • Wrap-up ravenben@cs.ucsb.edu

Implementation Performance • Java implementation • 35000+ lines in core Tapestry, 1500+ downloads • Micro-benchmarks • per msg overhead: ~ 50s, most latency from byte copying • performance scales w/ CPU speedup • 5KB msgs on P-IV 2.4Ghz: throughput ~ 10,000 msgs/sec • Routing stretch • route to node: < 2 • route to objects/endpoints: < 3higher stretch for close by objects ravenben@cs.ucsb.edu

killnodes constantchurn largegroup join success rate (%) Stability Under Membership Changes • Routing operations on 40 node Tapestry cluster • Churn: nodes join/leave every 10 seconds, average lifetime = 2mins ravenben@cs.ucsb.edu

Micro-benchmark Methodology SenderControl ReceiverControl LANLink Tapestry Tapestry • Experiment run in LAN, GBit Ethernet • Sender sends 60001 messages at full speed • Measure inter-arrival time for last 50000 msgs • 10000 msgs: remove cold-start effects • 50000 msgs: remove network jitter effects ravenben@cs.ucsb.edu

100mb/s Micro-benchmark Results (LAN) • Per msg overhead ~ 50s, latency dominated by byte copying • Performance scales with CPU speedup • For 5K messages, throughput = ~10,000 msgs/sec ravenben@cs.ucsb.edu

Large Scale Methodology • PlanetLab global network • 500 machines at 100+ institutions, in North America, Europe, Australia, Asia, Africa • 1.26Ghz PIII (1GB RAM), 1.8Ghz P4 (2GB RAM) • North American machines (2/3) on Internet2 • Tapestry Java deployment • 6-7 nodes on each physical machine • IBM Java JDK 1.30 • Node virtualization inside JVM and SEDA • Scheduling between virtual nodes increases latency ravenben@cs.ucsb.edu

Node to Node Routing (PlanetLab) Median=31.5, 90th percentile=135 • Ratio of end-to-end latency to ping distance between nodes • All node pairs measured, placed into buckets ravenben@cs.ucsb.edu

Latency to Insert Node • Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry • Humps due to expected filling of each routing level ravenben@cs.ucsb.edu

Thanks! Questions, comments? ravenben@cs.ucsb.edu

Object Location (PlanetLab) 90th percentile=158 • Ratio of end-to-end latency to client-object ping distance • Local-area stretch improved w/ additional location state ravenben@cs.ucsb.edu

Bandwidth to Insert Node • Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network • Per node bandwidth decreases with size of network ravenben@cs.ucsb.edu

An Overlay Infrastructure for Decentralized Object Location and Routing