DCS 3. Key-value Stores and NoSQL

DCS 3. Key-value Stores and NoSQL Wang Qi 2013.10.27

Outline • Why NoSQL? • Key-Value Store and NoSQL • Cassandra’s internals and technologies • When should use NoSQL • How to shift to NoSQL from SQL(RDBMS)

Why NoSQL • RDMS • Data stored in tables • Schema-based structured tables • Queried using SQL (Structured Query Language) • ACID

Mismatch with today’s workloads • Data: Large and unstructured • Lots of random reads and writes • Foreign keys and join querys are rarely needed • Too many locks • Need • Speed（Low latency） • No Single point of failure（High availability） • Incremental Scalability • Scale out, not up: use more machines that are off the shelf (COTS), not more powerful machines

CAP Theorem

Key Value Store and NoSQL

Cassandra • Designed and open sourced by Facebook • Features • Highly scalable and available • Eventually consistent • Distributed • Key-value store Distributed technologies from Dynamo Data model from BigTable Cassandra

Cassandra Internals: Data Model • Column • Name,Value,Timestamp • Up to 2 million columns • No schemas • Variable number of columns • Variable type of value • Stored in order

Cassandra Internals: Write Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Decide the node on which the data reside • Consistent Hashing • Replication strategy • Quorum • Store in data node: Commit log ->Memtables ->Respond to client • LSM-Tree（Log-Structured Merge Tree）

Cassandra Internals: Write Path • Data Partition: Decide the node on which the data reside

Consistent hashing • partitions data based on the primary key • assigns a hash value to each primary key • Each node is responsible for a range of data based on the hash value • places the data according to the hash value and the node range

Replication strategy 0 N16 N112 • Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. Primary replica for key K13 N96 N32 Read/write K13 N45 N80 Coordinator (typically one per DC) Backup replicas for key K13

Quorum and Consistency • Quorum: way of selecting sets so that any pair of sets intersect • E.g., any arbitrary set with at least Q=N/2 +1 nodes • N = total number of replicas for this key • R = read replica count, W = write replica count • Write to any nodes If W+R > N, you have consistency, i.e., each read returns the latest written value • Cassandra’s tunable consistency: One, Quorum, All, etc.

Log-Structure Merge Tree B+Tree: Random Write LSM-Tree: Sequential Write

Cassandra Internals: Write Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Decide the node on which the data reside • Consistent Hashing • Replication strategy • Store in data node: Commit log ->Memtables ->Respond to client • LSM-Tree（Log-Structured Merge Tree）

Writes at a data node On receiving a write • 1. Log it in disk commit log (log = append-only) • 2. Make changes to appropriate memtables • In-memory representation of multiple key-value pairs • Later, when memtable is reached a threshold, flush to disk • Data File: An SSTable (Sorted String Table) – list of key value pairs, sorted by key • Index file: An SSTable of (key, position in data sstable) pairs • Compaction: Merge multiple SSTables to one. • Data updates accumulate over time will generated several SSTables • Compaction can promote the performance of reads

Cassandra Internals: Read Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Similar to writes • Read in data node: Row Cache -> Memtable -> Bloom Filter -> Key Cache -> Memory index -> Disk index -> SSTable -> Respond to coordinator • Bloom Filter • Coordinator compare the result and respond to client • Read repair

Bloom Filter • Compact way of representing a set of items • Checking for existence in set is cheap • Some probability of false positives: an item not in set may check true as being in set • On insert, set all hashed bits. • On check-if-present, • return true if all hashed bits set. • False positives

Cassandra Internals: Read Path • Read in data node: Row Cache ->Memtable and SSTable • SSTable read path: Bloom Filter -> Key Cache -> Memory index ->Disk index ->SSTable • Respond to coordinator

Cassandra Internals: Eventual Consistent • Cassandra’s consistency comes in the form of eventual. As the data is replicated, the latest version is sitting on some nodes, but older versions are still on other nodes, eventually all nodes will see the latest version. • Hinted handoff • Read Repair • Anti-Entropy

Cluster Membership and Failure Detection • gossip-based cluster membership 2 1 Address generation (local) Heartbeat Version • Protocol: • Nodes periodically gossip their membership list • On receipt, the local membership list is updated • If any heartbeat older than Tfail, this node is marked as failed 4 3

A Gossip Round in Cassandra • Node A generates local digest message and send it to node B. • Node B receives the message and compare to its local information. Then node B send ack message with its full newer information to node A • Node A repeats the behavior like node B after it receives the ack and send its ack message back to node B. Finally node B updates its information based on this message.

Transaction in Cassandra • Atomicity • Row level atomicity • Consistency • Tunable consistency • Isolation • Row level isolation • Durability • Writes are durable through the commit log

Performance Evaluation • On > 50 GB data • MySQL • Writes 300 msavg • Reads 350 msavg • Cassandra • Writes 0.12 msavg • Reads 15 msavg

When should us use NoSQL • Big enough data • Nodes with high performance hardware • Live without RDBMS features • Secondary indexes • Transactions • Advanced query languages

Cassandra data modeling:Don’t think of a relational DBMS • Storing values in column names • A the sorted map gives efficient key lookup and efficient scans. • The number of column keys is almost unbounded. • Model column families around query patterns • Moderate de-normalize and duplicate for read performance • The cons of normalization are magnified and there are no joins since it’s high-scale distributed. • So with a fully normalized schema, reads may perform much worse. SortedMap<RowKey, SortedMap<ColumnName, (ColumnValue, Timestamp)>>

Example: ‘Like’ relationship between User & Item • Get user by user id • Get item by item id • Get all the items that a particular user likes • Get all the users who like a particular item

Replica of relational model • There is no easy way to query all the items that a particular user likes or all the users who like a particular item, because there are no efficient secondary indexes.

Normalized entities with de-normalized custom indexes • Title and username are de-normalized in User_By_Item and Item_By_User respectively. It’s efficient to query all the item titles liked by a given user, and all the user names who like a given item.

Best Principles of Cassandra Data Modeling • Keep the column name short except you use the column name to store actual data • Because it’s stored repeatedly. • Design the data model such that operations are idempotent • Idempotent operations allow partial failures in the system, as the operations can be retried safely. • If you need transactional behavior, try to model your data such that you would only need to update a single row at once • Cassandra offers row-level atomicity.

Thanks!Q&A

DCS 3. Key-value Stores and NoSQL

DCS 3. Key-value Stores and NoSQL

Presentation Transcript

NoSQL and .NET

NoSQL

Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud

NoSQL and Review

NoSQL

NoSQL Data Stores Data Stores for the Cloud

NoSQL and NOSQL

Data Scaling and Key-Value Stores

NOSQL

Key/Value Stores

CumulusRDF Linked Data Management on Nested Key-Value Stores

NoSQL and NewSQL

LogKV : Exploiting Key-Value Stores for Event Log Processing

Infinispan , transactional key- value DataGrid and NoSQL database

NoSQL and MongoDB

About PKI Key Stores

Some key-value stores using log-structure

NoSQL

QNX based DCS Unique Value Proposition

Key-Value stores

NoSQL

QNX based DCS Unique Value Proposition