260 likes | 539 Views
DeCandia,Hastorun,Jampani , Kakulapati , Lakshman , Pilchin , Sivasubramanian , Vosshall , Vogels : Dynamo: Amazon's highly available key-value store . SOSP 2007. Amazon’s Key-Value Store: Dynamo. Adapted from Amazon’s Dynamo Presentation. Motivation. Reliability at a massive scale
E N D
DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available key-value store. SOSP 2007 Amazon’s Key-Value Store: Dynamo Adapted from Amazon’s Dynamo Presentation UCSB CS271
Motivation • Reliability at a massive scale • Slightest outage significant financial consequences • High write availability • Amazon’s platform: 10s of thousands of servers and network components, geographically dispersed • Provide persistent storage in spite of failures • Sacrifice consistency to achieve performance, reliability, and scalability UCSB CS271
Dynamo Design rationale • Most services need key-based access: • Best-seller lists, shopping carts, customer preferences, session management, sales rank, product catalog, and so on. • Prevalent application design based on RDBMS technology will be catastrophic. • Dynamo therefore provides primary-key only interface. UCSB CS271
Dynamo Design Overview • Data partitioning using consistent hashing • Data replication • Consistency via version vectors • Replica synchronization via quorum protocol • Gossip-based failure-detection and membership protocol UCSB CS271
System Requirements • Data & Query Model: • Read/write operations via primary key • No relational schema: use <key, value> object • Object size < 1 MB, typically. • Consistency guarantees: • Weak • Only single key updates • Not clear if read-modify-write isolate • Efficiency: • SLA 99.9 percentile of operations • Notes: • Commodity hardware • Minimal security measures since for internal use UCSB CS271
Service Level Agreements (SLA) • Application can deliver its functionality in a bounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds. • Example SLA:service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. UCSB CS271
System Interface • Two basic operations: • Get(key): • Locates replicas • Returns the object + context (encodes meta data including version) • Put(key, context, object): • Writes the replicas to the disk • Context: version (vector timestamp) • Hash(key) 128-bit identifier UCSB CS271
Partition Algorithm • Consistent hashing: the output range of a hash function is treated as a fixed circular space or “ring” a la Chord. • “Virtual Nodes”:Each node can be responsible for more than one virtual node (to deal with non-uniform data and load distribution) UCSB CS271
Virtual Nodes UCSB CS271
Advantages of using virtual nodes • The number of virtual nodes that a node is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure. • A real node’s load can be distributed across the ring, thus ensuring a hot spot is not targeted to a single node. • If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. • When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. UCSB CS271
Replication • Each data item is replicated at N hosts. • preference list: The list of nodes that is responsible for storing a particular key. • Some fine-tuning to account for virtual nodes UCSB CS271
Replication UCSB CS271
Replication UCSB CS271
Preference Lists • List of nodes responsible for storing a particular key. • Due to failures, preference list contains more than N nodes. • Due to virtual nodes, preference list skips positions to ensure distinct physical nodes. UCSB CS271
Data Versioning • A put() call may return to its caller before the update has been applied at all the replicas • A get() call may return many versions of the same object. • Challenge: an object may have distinct versions • Solution: use vector clocks in order to capture causality between different versions of same object. UCSB CS271
Vector Clock • A vector clock is a list of (node, counter) pairs. • Every version of every object is associated with one vector clock. • If the allcounters on the first object’s clock are less-than-or-equal to all of the counters in the second clock, then the first is an ancestor of the second and can be forgotten. • Applicationreconciles divergent versions and collapses into a single new version. UCSB CS271
Vector clock example UCSB CS271
Routing requests • Route request through a generic load balancer that will select a node based on load information. • Use a partition-aware client library that routes requests directly to relevant node. • A gossip protocol propagates membership changes. Each node contacts a peer chosen at random every second and the two nodes reconcile their membership change histories. UCSB CS271
Sloppy Quorum • R and Wis the minimum number of nodes that must participate in a successful read/write operation. • Setting R + W > N yields a quorum-like system. • In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency and availability. UCSB CS271
Highlights of Dynamo • High write availability • Optimistic: vector clocks for resolution • Consistent hashing (Chord) in controlled environment • Quorums for relaxed consistency. UCSB CS271
Lakshman and Malik Cassandra—A Decentralized Structured Storage System. LADIS 2009 Cassandra (Facebook) UCSB CS271
Data Model • Key-value store—more like Bigtable. • Basically, a distributed multi-dimensional map indexed by a key. • Value is structured into Columns, which are grouped into Column Families: simple and super (column family within a column family). • An operation is atomic on a single row. • API: insert, get and delete. UCSB CS271
System Architecture • Like Dynamo (and Chord). • Uses order preserving hash function on a fixed circular space. Node responsible for a key is called the coordinator. • Non-uniform data distribution: keep track of data distribution and reorganize if necessary. UCSB CS271
Replication • Each item is replicated at N hosts. • Replicas can be: Rack Unaware; Rack Aware (within a data center); Datacenter Aware. • System has an elected leader. • When a node joins the system, the leader assigns it a range of data items and replicas. • Each node is aware of every other node in the system and the range they are responsible for. UCSB CS271
Membership and Failure Detection • Gossip-based mechanism to maintain cluster membership. • A node determines which nodes are up and down using a failure detector. • The Φ accrual failure detector returns a suspicion level, Φ, for each monitored node. • Say a node suspects A when Φ=1, 2, 3, then the likelihood of a mistake is 10%, 1% and .1%. • Every node maintains a sliding window of interarrival times of gossip messages from other nodes to determine distribution of interarrival times and then calculate Φ. Approximate using an exponential distribution. UCSB CS271
Operations • Use quorums: R and W • If R+W < N then read will return latest value. • Read operations return value with highest timestamp, so may return older versions • Read Repair: with every read, send newest version to any out-of-date replicas. • Anti-Entropy: compute Merkle tree to catch any out of synch data (expensive) • Each write: first into a persistent commit log, then an in-memory data structure. UCSB CS271