240 likes | 550 Views
Dynamo: Amazon's Highly Available Key-Value Store . CScI 8101 : Advanced Operating Systems Presented by: Chaithra KN. Clouded Data: Comprehending Scalable Data Management Systems.
E N D
Dynamo: Amazon's Highly Available Key-Value Store CScI 8101 : Advanced Operating Systems Presented by: Chaithra KN
Clouded Data: Comprehending Scalable Data ManagementSystems Managing petabytes of data for millions of users - challenge for big internet based enterprises such as Google, Yahoo!, and Amazon. Traditional Relational Database Systems – unsuitable (expensive, complex data model, repartitioning is complex) An architectural redesign of data management systems with requirements like: - High scalability - High availability - Low latency - Weaker consistency - Lower application generality
Clouded Data: Comprehending Scalable Data ManagementSystems Some services do not require complex querying and management functionality offered by RDBMS - access data by primary-key only Downgrade some of the service guarantees of traditional RDBMS There is a shift towards NoSQL or non-relational data stores - key-value stores. But relational databases are still needed. Example - in online stores, shopping cart and session management can be handled with a key-value system but the checkout process needs a traditional database Hybrid systems – with traditional RDBMS for some data and key-value stores wherever suitable. A key-value store allows storage of values by keys Amazon Dynamo, Yahoo! PNUTS, Google BigData
Amazon's Dynamo in a nutshell Highly decentralized – architecture similar to Peer-to-peer systems Provides key-value only interface. The value could be just a blob, the data store is not aware of what the contents of value are. Targeted mainly at applications that need an “always writeable” data store where no updates are rejected Built for latency sensitive applications that require atleast 99.9% of read and write operations to be performed within a few hundred milliseconds Sacrifices consistency under certain failure conditions to achieve high reliability - ”eventual consistency”. Nodes are homogenous in their functionality. Integrates and customizes a variety of techniques to achieve its goals: Consistent hashing, object versioning, vector clocks, sloppy quorum, hinted handoff, gossip based distributed failure detection and membership protocol
System Assumptions and Requirements Query Model: simple read and write operations to a data item that is uniquely identified by a key – no complex querying or accesses ranging multiple data items ACID Properties: Does not provide consistency and isolation guarantees Efficiency: latency requirements which are generally measured at the 99.9th percentile of the distribution – built for applications that have stringent SLA requirements Ex: A service that guarantees to provide a resonse within 300ms for 99.9% of its requests for a peak client load of 500 requests per second Internal only - Operation environment is assumed to be non-hostile and there are no security related requirements such as authentication and authorization.
Design requirements Incremental scalability Symmetry Decentralization Heterogenity Conflict resolution is pushed to reads in order to ensure that writes are never rejected.
Consistent Hashing Partitioning algorithm used to dynamically partition the data. The output range of a hash function is treated as a fixed circular space or ring. Each node in the system is assigned a random value within this space which represents its position in the ring. Data item's key hashed to determine the node which will process it. Virtual nodes – each node assigned to multiple positions Advantage - departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected. No need to store system state – i.e. no trackers needed to keep track of where my data is going into.
Advantages of virtual nodes If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. The number of virtual nodes that a node is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure Incremental scalability and heterogenity achieved.
Replication Needed to achieve high availability and durability Replicates data on N hosts – N is configurable Data stored locally + replicated on coordinator's N-1 clockwise successor nodes. Preference list – list of nodes responsible for storing a particular key.
Vector clocks Eventual consistency – updates propagated to all replicas asynchronously. All replicas eventually become consistent through reconciliation measures taken at the time of read operations. Multiple versions of data exist – updates can happen to older versions – divergent versions reconciled later. When newer versions subsume previous versions – system decides the final version. Version branching – conflicting versions of an object – the client performs the reconciliation (collapse). One vector clock associated with every version of every object – (node,counter) pair. While updating, a client must specify which version it is updating – using context obtained from an earlier read. Reconciliation during reads
System interface, execution of get and put get(key) put(key, context, object) – context contains information like version of the object Uses consistency protocol similar to quorum-like systems. R – minimum number of nodes that must participate in a successful read operation. W - minimum number of nodes that must participate in a successful write operation. Quorum-like system : R + W > N N – top N nodes from the preference list Strict quorum affects availablility and durability.
Selecting a node Two strategies that a client can use to select a node: (1) route its request through a generic load balancer that will select a node based on load information, or (2) use a partition-aware client library that routes requests directly to the appropriate coordinator nodes. The advantage of the first approach - client does not have to link any code specific to Dynamo in its application Second strategy - can achieve lower latency because it skips a potential forwarding step.
Sloppy quorum and hinted handoff Sloppy quorum used to handle temporary server failures and network partitions. All read and write operations are performed on first N healthy nodes from the preference list, which may or may not always be the first N nodes encountered while walking the consistent hashing ring. * The replacement node will get a hint in its metadata using which it will send replica to the intended recipient has recovered – Hinted handoff * To account for node failures, preference list contains more than N nodes.
Other techniques Merkel trees – to detect inconsistencies between replicas and to minimize the amount of transferred data. Gossipping – Nodes can be added and removed multiple times – Gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership.
Implementation Each storage node has 3 software components: - Request coordination - Membership and failure detection - Local persistence engine
Balancing performance and durability Buffered read and writes Missing writes if server crashes - One of the N replicas can perform a durable write without affecting performance
Positives and Negatives + No need to store system data separately.No concept of dynamic system state with consistency guarantees on that. + A single node is mapped onto multiple virtual nodes. This leads to uniform data and load distribution. + Merkel tree approach – very less data transfer needed to maintain consistency of replicas. + Client applications can tune the value of N, W and R to achieve their desired level of performance, availability and durability. - Static provisioning of hash space can lead to over-provisioning - Conflict reconciliations at application layer to solve inconsistency problems might increase software development cost.
Comparing PNUTS and Dynamo PNUTS - Hashed / Ordered tables - Generation based versioning - Communication through Pub / Sub YMB infrastructure (optimized for geographically separated replicas) - Partitioning into tablets - Timeline based consistency Dynamo - Key – value pairs - Vector clocks used - Gossip based - Consistent hashing - Eventual consistency and reconciliation
Clouded Data Ad hoc systems - It remains to be seen if such application level coordination without support from the data management service will be sufficient with more complex operations and heavier workloads. Atomic objects to single data items – is this sufficient?