610 likes | 1.17k Views
DCS 3. Key-value Stores and NoSQL. Wang Qi 2013.10.27. Outline. Why NoSQL ? Key-Value Store and NoSQL Cassandra’s internals and technologies When should use NoSQL How to shift to NoSQL from SQL(RDBMS). Why NoSQL. RDMS Data stored in tables Schema-based structured tables
E N D
DCS 3. Key-value Stores and NoSQL Wang Qi 2013.10.27
Outline • Why NoSQL? • Key-Value Store and NoSQL • Cassandra’s internals and technologies • When should use NoSQL • How to shift to NoSQL from SQL(RDBMS)
Why NoSQL • RDMS • Data stored in tables • Schema-based structured tables • Queried using SQL (Structured Query Language) • ACID
Mismatch with today’s workloads • Data: Large and unstructured • Lots of random reads and writes • Foreign keys and join querys are rarely needed • Too many locks • Need • Speed(Low latency) • No Single point of failure(High availability) • Incremental Scalability • Scale out, not up: use more machines that are off the shelf (COTS), not more powerful machines
Cassandra • Designed and open sourced by Facebook • Features • Highly scalable and available • Eventually consistent • Distributed • Key-value store Distributed technologies from Dynamo Data model from BigTable Cassandra
Cassandra Internals: Data Model • Column • Name,Value,Timestamp • Up to 2 million columns • No schemas • Variable number of columns • Variable type of value • Stored in order
Cassandra Internals: Write Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Decide the node on which the data reside • Consistent Hashing • Replication strategy • Quorum • Store in data node: Commit log ->Memtables ->Respond to client • LSM-Tree(Log-Structured Merge Tree)
Cassandra Internals: Write Path • Data Partition: Decide the node on which the data reside
Cassandra Internals: Write Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Decide the node on which the data reside • Consistent Hashing • Replication strategy • Quorum • Store in data node: Commit log ->Memtables ->Respond to client • LSM-Tree(Log-Structured Merge Tree)
Consistent hashing • partitions data based on the primary key • assigns a hash value to each primary key • Each node is responsible for a range of data based on the hash value • places the data according to the hash value and the node range
Cassandra Internals: Write Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Decide the node on which the data reside • Consistent Hashing • Replication strategy • Quorum • Store in data node: Commit log ->Memtables ->Respond to client • LSM-Tree(Log-Structured Merge Tree)
Replication strategy 0 N16 N112 • Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. Primary replica for key K13 N96 N32 Read/write K13 N45 N80 Coordinator (typically one per DC) Backup replicas for key K13
Cassandra Internals: Write Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Decide the node on which the data reside • Consistent Hashing • Replication strategy • Quorum • Store in data node: Commit log ->Memtables ->Respond to client • LSM-Tree(Log-Structured Merge Tree)
Quorum and Consistency • Quorum: way of selecting sets so that any pair of sets intersect • E.g., any arbitrary set with at least Q=N/2 +1 nodes • N = total number of replicas for this key • R = read replica count, W = write replica count • Write to any nodes If W+R > N, you have consistency, i.e., each read returns the latest written value • Cassandra’s tunable consistency: One, Quorum, All, etc.
Cassandra Internals: Write Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Decide the node on which the data reside • Consistent Hashing • Replication strategy • Quorum • Store in data node: Commit log ->Memtables ->Respond to client • LSM-Tree(Log-Structured Merge Tree)
Log-Structure Merge Tree B+Tree: Random Write LSM-Tree: Sequential Write
Cassandra Internals: Write Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Decide the node on which the data reside • Consistent Hashing • Replication strategy • Store in data node: Commit log ->Memtables ->Respond to client • LSM-Tree(Log-Structured Merge Tree)
Writes at a data node On receiving a write • 1. Log it in disk commit log (log = append-only) • 2. Make changes to appropriate memtables • In-memory representation of multiple key-value pairs • Later, when memtable is reached a threshold, flush to disk • Data File: An SSTable (Sorted String Table) – list of key value pairs, sorted by key • Index file: An SSTable of (key, position in data sstable) pairs • Compaction: Merge multiple SSTables to one. • Data updates accumulate over time will generated several SSTables • Compaction can promote the performance of reads
Cassandra Internals: Read Path • Client sends write request to one node in cluster (Coordinator) • Data Partition: Similar to writes • Read in data node: Row Cache -> Memtable -> Bloom Filter -> Key Cache -> Memory index -> Disk index -> SSTable -> Respond to coordinator • Bloom Filter • Coordinator compare the result and respond to client • Read repair
Bloom Filter • Compact way of representing a set of items • Checking for existence in set is cheap • Some probability of false positives: an item not in set may check true as being in set • On insert, set all hashed bits. • On check-if-present, • return true if all hashed bits set. • False positives
Cassandra Internals: Read Path • Read in data node: Row Cache ->Memtable and SSTable • SSTable read path: Bloom Filter -> Key Cache -> Memory index ->Disk index ->SSTable • Respond to coordinator
Cassandra Internals: Eventual Consistent • Cassandra’s consistency comes in the form of eventual. As the data is replicated, the latest version is sitting on some nodes, but older versions are still on other nodes, eventually all nodes will see the latest version. • Hinted handoff • Read Repair • Anti-Entropy
Cluster Membership and Failure Detection • gossip-based cluster membership 2 1 Address generation (local) Heartbeat Version • Protocol: • Nodes periodically gossip their membership list • On receipt, the local membership list is updated • If any heartbeat older than Tfail, this node is marked as failed 4 3
A Gossip Round in Cassandra • Node A generates local digest message and send it to node B. • Node B receives the message and compare to its local information. Then node B send ack message with its full newer information to node A • Node A repeats the behavior like node B after it receives the ack and send its ack message back to node B. Finally node B updates its information based on this message.
Transaction in Cassandra • Atomicity • Row level atomicity • Consistency • Tunable consistency • Isolation • Row level isolation • Durability • Writes are durable through the commit log
Performance Evaluation • On > 50 GB data • MySQL • Writes 300 msavg • Reads 350 msavg • Cassandra • Writes 0.12 msavg • Reads 15 msavg
When should us use NoSQL • Big enough data • Nodes with high performance hardware • Live without RDBMS features • Secondary indexes • Transactions • Advanced query languages
Cassandra data modeling:Don’t think of a relational DBMS • Storing values in column names • A the sorted map gives efficient key lookup and efficient scans. • The number of column keys is almost unbounded. • Model column families around query patterns • Moderate de-normalize and duplicate for read performance • The cons of normalization are magnified and there are no joins since it’s high-scale distributed. • So with a fully normalized schema, reads may perform much worse. SortedMap<RowKey, SortedMap<ColumnName, (ColumnValue, Timestamp)>>
Example: ‘Like’ relationship between User & Item • Get user by user id • Get item by item id • Get all the items that a particular user likes • Get all the users who like a particular item
Replica of relational model • There is no easy way to query all the items that a particular user likes or all the users who like a particular item, because there are no efficient secondary indexes.
Normalized entities with de-normalized custom indexes • Title and username are de-normalized in User_By_Item and Item_By_User respectively. It’s efficient to query all the item titles liked by a given user, and all the user names who like a given item.
Best Principles of Cassandra Data Modeling • Keep the column name short except you use the column name to store actual data • Because it’s stored repeatedly. • Design the data model such that operations are idempotent • Idempotent operations allow partial failures in the system, as the operations can be retried safely. • If you need transactional behavior, try to model your data such that you would only need to update a single row at once • Cassandra offers row-level atomicity.