Amazon’s Key-Value Store: Dynamo

DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available key-value store. SOSP 2007 Amazon’s Key-Value Store: Dynamo Adapted from Amazon’s Dynamo Presentation UCSB CS271

Motivation • Reliability at a massive scale • Slightest outage  significant financial consequences • High write availability • Amazon’s platform: 10s of thousands of servers and network components, geographically dispersed • Provide persistent storage in spite of failures • Sacrifice consistency to achieve performance, reliability, and scalability UCSB CS271

Dynamo Design rationale • Most services need key-based access: • Best-seller lists, shopping carts, customer preferences, session management, sales rank, product catalog, and so on. • Prevalent application design based on RDBMS technology will be catastrophic. • Dynamo therefore provides primary-key only interface. UCSB CS271

Dynamo Design Overview • Data partitioning using consistent hashing • Data replication • Consistency via version vectors • Replica synchronization via quorum protocol • Gossip-based failure-detection and membership protocol UCSB CS271

System Requirements • Data & Query Model: • Read/write operations via primary key • No relational schema: use <key, value> object • Object size < 1 MB, typically. • Consistency guarantees: • Weak • Only single key updates • Not clear if read-modify-write isolate • Efficiency: • SLA 99.9 percentile of operations • Notes: • Commodity hardware • Minimal security measures since for internal use UCSB CS271

Service Level Agreements (SLA) • Application can deliver its functionality in a bounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds. • Example SLA:service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. UCSB CS271

System Interface • Two basic operations: • Get(key): • Locates replicas • Returns the object + context (encodes meta data including version) • Put(key, context, object): • Writes the replicas to the disk • Context: version (vector timestamp) • Hash(key)  128-bit identifier UCSB CS271

Partition Algorithm • Consistent hashing: the output range of a hash function is treated as a fixed circular space or “ring” a la Chord. • “Virtual Nodes”:Each node can be responsible for more than one virtual node (to deal with non-uniform data and load distribution) UCSB CS271

Virtual Nodes UCSB CS271

Advantages of using virtual nodes • The number of virtual nodes that a node is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure. • A real node’s load can be distributed across the ring, thus ensuring a hot spot is not targeted to a single node. • If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. • When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. UCSB CS271

Replication • Each data item is replicated at N hosts. • preference list: The list of nodes that is responsible for storing a particular key. • Some fine-tuning to account for virtual nodes UCSB CS271

Replication UCSB CS271

Preference Lists • List of nodes responsible for storing a particular key. • Due to failures, preference list contains more than N nodes. • Due to virtual nodes, preference list skips positions to ensure distinct physical nodes. UCSB CS271

Data Versioning • A put() call may return to its caller before the update has been applied at all the replicas • A get() call may return many versions of the same object. • Challenge: an object may have distinct versions • Solution: use vector clocks in order to capture causality between different versions of same object. UCSB CS271

Vector Clock • A vector clock is a list of (node, counter) pairs. • Every version of every object is associated with one vector clock. • If the allcounters on the first object’s clock are less-than-or-equal to all of the counters in the second clock, then the first is an ancestor of the second and can be forgotten. • Applicationreconciles divergent versions and collapses into a single new version. UCSB CS271

Vector clock example UCSB CS271

Routing requests • Route request through a generic load balancer that will select a node based on load information. • Use a partition-aware client library that routes requests directly to relevant node. • A gossip protocol propagates membership changes. Each node contacts a peer chosen at random every second and the two nodes reconcile their membership change histories. UCSB CS271

Sloppy Quorum • R and Wis the minimum number of nodes that must participate in a successful read/write operation. • Setting R + W > N yields a quorum-like system. • In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency and availability. UCSB CS271

Highlights of Dynamo • High write availability • Optimistic: vector clocks for resolution • Consistent hashing (Chord) in controlled environment • Quorums for relaxed consistency. UCSB CS271

Lakshman and Malik Cassandra—A Decentralized Structured Storage System. LADIS 2009 Cassandra (Facebook) UCSB CS271

Data Model • Key-value store—more like Bigtable. • Basically, a distributed multi-dimensional map indexed by a key. • Value is structured into Columns, which are grouped into Column Families: simple and super (column family within a column family). • An operation is atomic on a single row. • API: insert, get and delete. UCSB CS271

System Architecture • Like Dynamo (and Chord). • Uses order preserving hash function on a fixed circular space. Node responsible for a key is called the coordinator. • Non-uniform data distribution: keep track of data distribution and reorganize if necessary. UCSB CS271

Replication • Each item is replicated at N hosts. • Replicas can be: Rack Unaware; Rack Aware (within a data center); Datacenter Aware. • System has an elected leader. • When a node joins the system, the leader assigns it a range of data items and replicas. • Each node is aware of every other node in the system and the range they are responsible for. UCSB CS271

Membership and Failure Detection • Gossip-based mechanism to maintain cluster membership. • A node determines which nodes are up and down using a failure detector. • The Φ accrual failure detector returns a suspicion level, Φ, for each monitored node. • Say a node suspects A when Φ=1, 2, 3, then the likelihood of a mistake is 10%, 1% and .1%. • Every node maintains a sliding window of interarrival times of gossip messages from other nodes to determine distribution of interarrival times and then calculate Φ. Approximate using an exponential distribution. UCSB CS271

Operations • Use quorums: R and W • If R+W < N then read will return latest value. • Read operations return value with highest timestamp, so may return older versions • Read Repair: with every read, send newest version to any out-of-date replicas. • Anti-Entropy: compute Merkle tree to catch any out of synch data (expensive) • Each write: first into a persistent commit log, then an in-memory data structure. UCSB CS271

Amazon’s Key-Value Store: Dynamo

Amazon’s Key-Value Store: Dynamo

Presentation Transcript

Dynamo: Amazon’s Highly Available Key-value Store

DYNAMO

CSC 536 Lecture 9

Houston Dynamo

Dynamo

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers

Amazon Web Services: Amazon Elastic Compute Cloud (Amazon EC2)

New to Amazon Affiliate? Here’s what you need to know.

Dynamo: Amazon’s Highly Available Key-value Store

Amazon Webstore Development

Amazon Webstore Development

E-commerce and Store Retailing: Introduction and Issues

Dynamo: Amazon's Highly Available Key-value Store

Dynamo: Amazon’s Highly Available Key-value Store

AWS

Overreach Blood of Patriots Book is now Available for Kindle

Amazon Affiliate Store Script

India’s Largest Sports Store at Amazon

Coolpad Brand Store On Amazon

Brand Protection Through Amazon Brand Registry: Your Questions Answered

Amazon Marketplace Management Services