890 likes | 1.01k Views
CPE 401 / 601 Computer Network Systems. Distributed Hash Tables. Mehmet Gunes. Modified from Ashwin Bharambe , Robert Morris, Krishna Gummadi and Jinyang Li. Peer to peer systems. Symmetric in node functionalities Anybody can join and leave Everybody gives and takes
E N D
CPE 401 / 601Computer Network Systems Distributed Hash Tables Mehmet Gunes Modified from AshwinBharambe, Robert Morris, Krishna Gummadi and JinyangLi
Peer to peer systems • Symmetric in node functionalities • Anybody can join and leave • Everybody gives and takes • Great potentials • Huge aggregate network capacity, storage etc. • Resilient to single point of failure and attack • E.g. Gnapster, Gnutella, Bittorrent, Skype, Joost
The lookup problem N2 N1 N3 Put (Key=“title” Value=file data…) Internet ? Client Publisher Get(key=“title”) N4 N6 N5 • At the heart of all DHTs
Centralized lookup (Napster) N2 N1 SetLoc(“title”, N4) N3 Client DB N4 Publisher@ Lookup(“title”) Key=“title” Value=file data… N8 N9 N7 N6 Simple, but O(N) state and a single point of failure • Not scalable • Central service needs $$$
Improved #1: Hierarchical lookup • Performs hierarchical lookup (like DNS) • More scalable than a central site • Root of the hierarchy is still a single point of vulnerability
Flooded queries (Gnutella) N2 N1 Lookup(“title”) N3 Client N4 Publisher@ Key=“title” Value=file data… N6 N8 N7 N9 Too much overhead traffic Limited in scale No guarantee for results Robust, but worst case O(N) messages per lookup
Improved #2: super peers • E.g. Kazaa, Skype • Lookup only floods super peers • Still no guarantees for lookup results
DHT approach: routing • Structured: • put(key, value); get(key, value) • maintain routing state for fast lookup • Symmetric: • all nodes have identical roles • Many protocols: • Chord, Kademlia, Pastry, Tapestry, CAN, Viceroy
Routed queries (Freenet, Chord, etc.) N2 N1 N3 Client N4 Lookup(“title”) Publisher Key=“title” Value=file data… N6 N8 N7 N9
Hash Table • Name-value pairs (or key-value pairs) • E.g,. “Mehmet Hadi Gunes” and mgunes@unr.edu • E.g., “http://cse.unr.edu/” and the Web page • E.g., “HitSong.mp3” and “12.78.183.2” • Hash table • Data structure that associates keys with values value lookup(key) key value
Key Values: Examples Amazon: • Key: customerID • Value: customer profile (e.g., buying history, credit card, ..) Facebook, Twitter: • Key: UserID • Value: user profile (e.g. posting history, photos, friends, …) • iCloud/iTunes: • Key: Movie/song name • Value: Movie, Song
System Examples Amazon • Dynamo: internal key value store used to power Amazon.com (shopping cart) • Simple Storage System (S3) BigTable/HBase/Hypertable: distributed, scalable data storage Cassandra: “distributed data management system” (developed by Facebook) Memcached: in-memory key-value store for small chunks of arbitrary data (strings, objects) eDonkey/eMule: peer-to-peer sharing system …
Distributed Hash Table • Hash table spread over many nodes • Distributed over a wide area • Main design goals • Decentralization • no central coordinator • Scalability • efficient even with large # of nodes • Fault tolerance • tolerate nodes joining/leaving
…. node node node Distributed hash table (DHT) Distributed application data get (key) put(key, data) Distributed hash table lookup(key) node IP address Lookup service • DHT distributes data storage over many nodes • Each node knows little information • Low computational/memory overhead
Key Value Store Also called a Distributed Hash Table (DHT) Main idea: partition set of key-values across many machines key, value …
K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea
K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea Neighboring nodes are “connected” at the application-level
K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea Operation: take key as input; route messages to node holding key
K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea insert(K1,V1) Operation: take key as input; route messages to node holding key
K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea insert(K1,V1) Operation: take key as input; route messages to node holding key
(K1,V1) K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea Operation: take key as input; route messages to node holding key
K V K V K V K V K V K V K V K V K V K V K V DHT: basic idea retrieve (K1) Operation: take key as input; route messages to node holding key
Distributed Hash Table • Two key design decisions • How do we map names on to nodes? • How do we route a request to that node?
Hash Functions • Hashing • Transform the key into a number • And use the number to index an array • Example hash function • Hash(x) = x mod 101, mapping to 0, 1, …, 100 • Challenges • What if there are more than 101 nodes? Fewer? • Which nodes correspond to each hash value? • What if nodes come and go over time?
Consistent Hashing • “view” = subset of hash buckets that are visible • For this conversation, “view” is O(n) neighbors • But don’t need strong consistency on views • Desired features • Balanced: in any one view, load is equal across buckets • Smoothness: little impact on hash bucket contents when buckets are added/removed • Spread: small set of hash buckets that may hold an object regardless of views • Load: across views, # objects assigned to hash bucket is small
Challenges Fault Tolerance: handle machine failures without losing data and without degradation in performance Scalability: • Need to scale to thousands of machines • Need to allow easy addition of new machines Consistency: maintain data consistency in face of node failures and message losses Heterogeneity (if deployed as peer-to-peer systems): • Latency: ms to secs • Bandwidth: kb/s to Gb/s …
Key Fundamental Design Idea - I • Consistent Hashing • Map keys and nodes to an identifier space; implicit assignment of responsibility B C D A Identifiers 1111111111 0000000000 • Mapping performed using hash functions (e.g., SHA-1) • Spread nodes and keys uniformly throughout
Fundamental Design Idea - II • Prefix / Hypercube routing Source Destination
Definition of a DHT • Hash table supports two operations • insert(key, value) • value = lookup(key) • Distributed • Map hash-buckets to nodes • Requirements • Uniform distribution of buckets • Cost of insert and lookup should scale well • Amount of local state (routing table size) should scale well
DHT ideas • Scalable routing needs a notion of “closeness” • Ring structure [Chord]
01100100 01100101 01100110 01100111 011000** 011001** 011010** 011011** Different notions of “closeness” • Tree-like structure, [Pastry, Kademlia, Tapestry] 01100101 Distance measured as the length of longest matching prefix with the lookup key
0,0.5, 0.5,1 0.5,0.5, 1,1 0.5,0.25,0.75,0.5 0.75,0, 1,0.5 0,0, 0.5,0.5 0.5,0,0.75,0.25 Different notions of “closeness” • Cartesian space [CAN] (1,1) Distance measured as geometric distance (0,0)
Directory-Based Architecture Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys Master/Directory put(K14, V14) K5 N2 K14 N3 K105 N50 put(K14, V14) K5 V5 K14 V14 K105 V105 … N2 N3 N50 N1
Directory-Based Architecture Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys Master/Directory get(K14) V14 K5 N2 K14 N3 K105 N50 get(K14) V14 K5 V5 K14 V14 K105 V105 … N2 N3 N50 N1
Directory-Based Architecture Having the master relay the requests recursive query Another method: iterative query (below) • Return node to requester and let requester contact node Master/Directory put(K14, V14) N3 K14 N3 K5 N2 put(K14, V14) K105 N50 K14 V14 K5 V5 K105 V105 … N2 N3 N50 N1
Directory-Based Architecture Having the master relay the requests recursive query Another method: iterative query • Return node to requester and let requester contact node get(K14) N3 Master/Directory K14 N3 V14 K5 N2 get(K14) K105 N50 K14 V14 K5 V5 K105 V105 … N2 N3 N50 N1
Discussion: Iterative vs. Recursive Query Recursive Query Advantages: • Faster, as typically master/directory closer to nodes • Easier to maintain consistency, as master/directory can serialize puts()/gets() Disadvantages: scalability bottleneck, as all “Values” go through master/directory Iterative Query Advantages: more scalable Disadvantages: slower, harder to enforce data consistency Recursive Iterative Master/Directory Master/Directory get(K14) get(K14) V14 N3 N3 K14 N3 K14 V14 get(K14) get(K14) V14 V14 V14 K14 K14 … … N2 N3 N50 N2 N3 N50 N1 N1
Fault Tolerance Replicate value on several nodes Usually, place replicas on different racks in a datacenterto guard against rack failures Master/Directory put(K14, V14) N1, N3 K5 N2 K14 N1,N3 K105 N50 put(K14, V14), N1 put(K14, V14) K5 V5 K14 K14 V14 V14 K105 V105 … N2 N3 N50 N1
Fault Tolerance Again, we can have • Recursive replication (previous slide) • Iterative replication (this slide) Master/Directory put(K14, V14) N1, N3 K5 N2 K14 N1,N3 K105 N50 put(K14, V14) put(K14, V14) K5 V5 K14 K14 V14 V14 K105 V105 … N2 N3 N50 N1
Fault Tolerance Or we can use recursive query and iterative replication… Master/Directory put(K14, V14) K5 N2 K14 N1,N3 K105 N50 put(K14, V14) put(K14, V14) K5 V5 K14 K14 V14 V14 K105 V105 … N2 N3 N50 N1
Scalability Storage: use more nodes Number of requests: • Can serve requests from all nodes on which a value is stored in parallel • Master can replicate a popular value on more nodes Master/directory scalability: • Replicate it • Partition it, so different keys are served by different masters/directories Q: How do you partition?
Scalability: Load Balancing Directory keeps track of storage availability at each node • Preferentially insert new values on nodes with more storage available What happens when a new node is added? • Move values from heavy loaded nodes to the new node What happens when a node fails? • Need to replicate values from failed node to other nodes
Consistency Need to make sure that a value is replicated correctly How do we know a value has been replicated on every node? • Wait for acknowledgements from every node What happens if a node fails during replication? • Pick another node and try again What happens if a node is slow? • Slow down the entire put()? Pick another node? In general, with multiple replicas • Slow puts and fast gets
Consistency (cont’d) If concurrent updates (i.e., puts to same key) may need to make sure that updates happen in the same order Master/Directory • put(K14, V14’) and put(K14, V14’’) reach N1 and N3 in reverse order • What does get(K14) return? • Undefined! K5 N2 put(K14, V14’’) put(K14, V14’) K14 N1,N3 K105 N50 put(K14, V14’') put(K14, V14’) put(K14, V14’’) put(K14, V14’) K14 K14 K14 K14 V14 V14’ V14’’ V14 K5 V5 K105 V105 … N2 N3 N50 N1
Consistency Key consistency: All nodes agree on who holds a particular key Data consistency: All nodes agree on the value associated with a key Reachability consistency: All nodes are connected and can be reached in a lookup. Chord aims to provide eventual reachability consistency, upon which key and data consistency are built.
Quorum Consensus Improve put() and get() operation performance • Define a replica set of size N • put() waits for acknowledgements from at least W replicas • get() waits for responses from at least R replicas • W+R > N Why does it work? • There is at least one node that contains the update
Quorum Consensus Example N=3, W=2, R=2 Replica set for K14: {N1, N3, N4} Assume put() on N3 fails put(K14, V14) ACK put(K14, V14) put(K14, V14) ACK K14 V14 K14 V14 N2 N3 N4 N1
Quorum Consensus Example Now, issuing get() to any two nodes out of three will return the answer get(K14) get(K14) nil V14 K14 V14 K14 V14 N2 N3 N4 N1
Scaling Up Directory Challenge: • Directory contains a number of entries equal to number of (key, value) tuples in the system • Can be hundreds of billions of entries in the system! Solution: consistent hashing Associate to each node a unique idin an uni-dimensional space 0..2m-1 • Partition this space across m machines • Assume keys are in same uni-dimensional space • Each (Key, Value) is stored at the node with the smallest ID larger than Key