1 / 45

Chord+DHash+Ivy: Building Principled Peer-to-Peer Systems

Chord+DHash+Ivy: Building Principled Peer-to-Peer Systems. Robert Morris rtm@lcs.mit.edu Joint work with F. Kaashoek, D. Karger, I. Stoica, H. Balakrishnan, F. Dabek, T. Gil, B. Chen, and A. Muthitacharoen. What is a P2P system?. Node. Node. Node. A distributed system architecture:

oberon
Download Presentation

Chord+DHash+Ivy: Building Principled Peer-to-Peer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chord+DHash+Ivy: Building Principled Peer-to-Peer Systems Robert Morris rtm@lcs.mit.edu Joint work with F. Kaashoek, D. Karger, I. Stoica, H. Balakrishnan, F. Dabek, T. Gil, B. Chen, and A. Muthitacharoen

  2. What is a P2P system? Node Node Node • A distributed system architecture: • No centralized control • Nodes are symmetric in function • Large number of unreliable nodes • Enabled by technology improvements Internet Node Node

  3. The promise of P2P computing • High capacity through parallelism: • Many disks • Many network connections • Many CPUs • Reliability: • Many replicas • Geographic distribution • Automatic configuration • Useful in public and proprietary settings

  4. …. node node node Distributed hash table (DHT) (Ivy) Distributed application data get (key) put(key, data) (DHash) Distributed hash table lookup(key) node IP address (Chord) Lookup service • Application may be distributed over many nodes • DHT distributes data storage over many nodes

  5. A DHT has a good interface • Put(key, value) and get(key)  value • Call a key/value pair a “block” • API supports a wide range of applications • DHT imposes no structure/meaning on keys • Key/value pairs are persistent and global • Can store keys in other DHT values • And thus build complex data structures

  6. A DHT makes a good shared infrastructure • Many applications can share one DHT service • Much as applications share the Internet • Eases deployment of new applications • Pools resources from many participants • Efficient due to statistical multiplexing • Fault-tolerant due to geographic distribution

  7. Many recent DHT-based projects • File sharing [CFS, OceanStore, PAST, …] • Web cache [Squirrel, ..] • Backup store [Pastiche] • Censor-resistant stores [Eternity, FreeNet,..] • DB query and indexing [Hellerstein, …] • Event notification [Scribe] • Naming systems [ChordDNS, Twine, ..] • Communication primitives [I3, …] Common thread: data is location-independent

  8. The lookup problem N2 N1 N3 Put (Key=“title” Value=file data…) Internet ? Client Publisher Get(key=“title”) N4 N6 N5 • At the heart of all DHTs

  9. Centralized lookup (Napster) N2 N1 SetLoc(“title”, N4) N3 Client DB N4 Publisher@ Lookup(“title”) Key=“title” Value=file data… N8 N9 N7 N6 Simple, but O(N) state and a single point of failure

  10. Flooded queries (Gnutella) N2 N1 Lookup(“title”) N3 Client N4 Publisher@ Key=“title” Value=file data… N6 N8 N7 N9 Robust, but worst case O(N) messages per lookup

  11. Routed queries (Freenet, Chord, etc.) N2 N1 N3 Client N4 Lookup(“title”) Publisher Key=“title” Value=file data… N6 N8 N7 N9

  12. Chord lookup algorithm properties • Interface: lookup(key)  IP address • Efficient: O(log N) messages per lookup • N is the total number of servers • Scalable: O(log N) state per node • Robust: survives massive failures • Simple to analyze

  13. Chord IDs • Key identifier = SHA-1(key) • Node identifier = SHA-1(IP address) • SHA-1 distributes both uniformly • How to map key IDs to node IDs?

  14. Chord Hashes a Key to its Successor Key ID Node ID N10 K5, K10 K100 N100 Circular ID Space N32 K11, K30 K65, K70 N80 N60 K33, K40, K52 • Successor: node with next highest ID

  15. Basic Lookup N5 N10 N110 “Where is key 50?” N20 N99 “Key 50 is At N60” N32 N40 N80 N60 • Lookups find the ID’s predecessor • Correct if successors are correct

  16. Successor Lists Ensure Robust Lookup 10, 20, 32 N5 20, 32, 40 N10 5, 10, 20 N110 32, 40, 60 N20 110, 5, 10 N99 40, 60, 80 N32 N40 60, 80, 99 99, 110, 5 N80 N60 80, 99, 110 • Each node remembers r successors • Lookup can skip over dead nodes to find blocks

  17. Chord “Finger Table” Accelerates Lookups ½ ¼ 1/8 1/16 1/32 1/64 1/128 N80

  18. Chord lookups take O(log N) hops N5 N10 N110 K19 N20 N99 N32 Lookup(K19) N80 N60

  19. Simulation Results: ½ log2(N) Average Messages per Lookup Number of Nodes • Error bars mark 1st and 99th percentiles

  20. DHash Properties • Builds key/value storage on Chord • Replicates blocks for availability • Caches blocks for load balance • Authenticates block contents

  21. DHash Replicates blocks at r successors N5 N10 N110 N20 N99 Block 17 N40 N50 N80 N68 N60 • Replicas are easy to find if successor fails • Hashed node IDs ensure independent failure

  22. DHash Data Authentication • Two types of DHash blocks: • Content-hash: key = SHA-1(data) • Public-key: key is a public key, data is signed by that key • DHash servers verify before accepting • Clients verify result of get(key)

  23. Ivy File System Properties • Traditional file-system interface (almost) • Read/write for multiple users • No central components • Trusted service from untrusted components

  24. Straw Man: Shared Structure Root Inode • Standard meta-data in DHT blocks? • What about locking during updates? • Requires 100% trust Directory Block File1 Inode File2 Inode File3 Inode File3 Data

  25. Ivy Design Overview • Log structured • Avoids in-place updates • Each participant writes only its own log • Avoids concurrent updates to DHT data • Each participant reads all logs • Private snapshots for speed

  26. Internet Ivy Software Structure DHT Node user Ivy Server App system calls DHT Node NFS RPCs NFS Client DHT Node kernel

  27. One Participant’s Ivy Log Record 1 Record 2 Record 3 Log Head Immutable content-hash DHT blocks Mutable public-key signed DHT block • Log-head contains DHT key of most recent record • Each record contains DHT key of previous record

  28. Ivy I-Numbers • Every file has a unique I-Number • Log records contain I-Numbers • Ivy returns I-Numbers to NFS client • NFS requests contain I-Numbers • In the NFS file handle

  29. NFS/Ivy Communication Example Local NFS Client Local Ivy Server LOOKUP(“d”, I-Num=1000) I-Num=1000 CREATE(“aaa”, I-Num=1000) I-Num=9956 WRITE(“hello”, 0, I-Num=9956) OK • echo hello > d/aaa • LOOKUP finds the I-Number of directory “d” • CREATE creates file “aaa” in directory “d” • WRITE writes “hello” at offset 0 in file “aaa”

  30. Log Records for File Creation Type: Create I-num: 9956 Type: Link Dir I-num: 1000 File I-num: 9956 Name: “aaa” Type: Write I-num: 9956 Offset: 0 Data: “hello” … Log Head • A log record describes a change to the file system

  31. Scanning an Ivy Log Type: Link Dir I-num: 1000 File I-num: 9956 Name: “aaa” Type: Link Dir I-num: 1000 File I-num: 9876 Name: “bbb” Type: Remove Dir I-num: 1000 Name: “aaa” • A scan follows the log backwards in time • LOOKUP(name, dir I-num): last Link, but stop at Remove • READDIR(dir I-num): accumulate Links, minus Removes

  32. Finding Other Logs: The View Block Log Head 1 View Block Pub Key 1 Pub Key 2 Pub Key 3 Log Head 2 Log Head 3 • View block is immutable (content-hash DHT block) • View block’s DHT key names the file system • Example: /ivy/37ae5ff901/aaa

  33. Reading Multiple Logs 27 31 32 20 Log Head 1 26 27 28 29 30 Log Head 2 • Problem: how to interleave log records? • Red numbers indicate real time of record creation • But we cannot count on synchronized clocks

  34. Vector Timestamps Encode Partial Order 27 31 32 20 Log Head 1 26 27 28 29 30 Log Head 2 • Each log record contains vector of DHT keys • One vector entry per log • Entry points to log’s most recent record

  35. Snapshots • Scanning the logs is slow • Each participant keeps a private snapshot • Log pointers as of snapshot creation time • Table of all I-nodes • Each file’s attributes and contents • Reflects all participants’ logs • Participant updates periodically from logs • All snapshots share storage in the DHT

  36. Simultaneous Updates • Ordinary file servers serialize all updates • Ivy does not • Most cases are not a problem: • Simultaneous writes to the same file • Simultaneous creation of different files in same directory • Problem case: • Unlink(“a”) and rename(“a”, “b”) at same time • Ivy correctly lets only one take effect • But it may return “success” status for both

  37. Integrity • Can attacker corrupt my files? • Not unless attacker is in my Ivy view • What if a participant goes bad? • Others can ignore participant’s whole log • Ignore entries after some date • Ignore just harmful records

  38. Ivy Performance • Half as fast as NFS on LAN and WAN • Scalable w/ # of participants • These results were taken yesterday…

  39. Ivy Server DHash Server App NFS Client Local Benchmark Configuration • One log • One DHash server • Ivy+DHash all on one host

  40. Ivy Local Performance on MAB • Modified Andrew Benchmark times (seconds) • NFS: client – LAN – server • 7 seconds doing public key signatures, 3 in DHash

  41. WAN Benchmark Details • 4 DHash nodes at MIT, CMU, NYU, Cornell • Round-trip times: 8, 14, 22 milliseconds • No DHash replication • 4 logs • One active writer at MIT • Whole-file read on open() • Whole-file write on close() • NFS client/server round-trip time is 14 ms

  42. Ivy WAN Performance • 47 seconds fetching log heads, 4 writing log head, • 16 inserting log records, 22 in crypto and CPU

  43. Ivy Performance w/ Many Logs • MAB on 4-node WAN • One active writer • Increasing cost due to growing vector timestamps

  44. Related Work • DHTs • Pastry, CAN, Tapestry • File systems • LFS, Zebra, xFS • Byzantine agreement • BFS, OceanStore, Farsite

  45. Summary • Exploring use of DHTs as a building block • Put/get API is general • Provides availability, authentication • Harnesses decentralized peer-to-peer groups • Case study of DHT use: Ivy • Read/write peer-to-peer file system • Trustable system from untrusted pieces http://pdos.lcs.mit.edu/chord

More Related