580 likes | 648 Views
Introduction to NoSQL Cassandra HBase. Many thanks to D. Tsoumakos for the slides Material adapted from slides by : Perry Hoekstra and Gary Dusbabek ( Rackspace ) CS 525 Indranil Gupta. History of the World. Relational Databases – mainstay of business
E N D
Introduction to NoSQLCassandraHBase Many thanks to D. Tsoumakos for the slides Material adapted from slides by : Perry Hoekstra andGary Dusbabek(Rackspace) CS 525 Indranil Gupta
History of the World • Relational Databases – mainstay of business • Web-based applications caused spikes • Especially true for public-facing e-Commerce sites • Developers begin to front RDBMS with memcache or integrate other caching mechanisms within the application (ie. Ehcache)
SQL • Specialized data structures (think B-trees) • Shines with complicated queries • Focus on fast query & analysis • Not necessarily on large datasets
Scaling Up • Issues with scaling up when the dataset is just too big • RDBMS were not designed to be distributed • Began to look at multi-node database solutions • Known as ‘scaling out’ or ‘horizontal scaling’ • Different approaches include: • Master-slave • Sharding
Scaling RDBMS – Master/Slave • Master-Slave • All writes are written to the master. All reads performed against the replicated slave databases • Critical reads may be incorrect as writes may not have been propagated down • Large data sets can pose problems as master needs to duplicate data to slaves
Scaling RDBMS - Sharding • Partition or sharding • Scales well for both reads and writes • Not transparent, application needs to be partition-aware • Can no longer have relationships/joins across partitions • Loss of referential integrity across shards
What is NoSQL? • Stands for Not Only SQL • Class of non-relational data storage systems • Usually do not require a fixed table schema nor do they use the concept of joins • All NoSQL offerings relax one or more of the ACID properties (will talk about the CAP theorem)
How did we get here? • Explosion of social media sites (Facebook, Twitter) with large data needs • Rise of cloud-based solutions such as Amazon S3 (simple storage solution) • Just as moving to dynamically-typed languages (Ruby/Groovy), a shift to dynamically-typed data with frequent schema changes • Open-source community
More Programming and Less Database Design Alternative to traditional relational DBMS • Flexible schema • Quicker/cheaper to set up • Massive scalability • Relaxed consistency higher performance & availability • No declarative query language more programming • Relaxed consistency fewer guarantees
Challenge: Coordination • The solution to availability and scalability is to decentralize and replicate functions and data…but how do we coordinate the nodes? • data consistency • update propagation • mutual exclusion • consistent global states • group membership • group communication • event ordering • distributed consensus • quorum consensus
Dynamo and BigTable • Three major papers were the seeds of the NoSQL movement • BigTable (Google) • Dynamo (Amazon) • Gossip protocol (discovery and error detection) • Distributed key-value data store • Eventual consistency • CAP Theorem
CAP Theorem • Proposed by Eric Brewer (Berkeley) • Subsequently proved by Gilbert and Lynch • In a distributed system you can satisfy at most 2 out of the 3 guarantees • Consistency: all nodes have same data at any time • Availability: the system allows operations all the time • Partition-tolerance: the system continues to work in spite of network partitions
CAP Theorem • Three properties of a system: consistency, availability and partitions • “You can have at most two of these three properties for any shared-data system” • To scale out, you have to partition. That leaves either consistency or availability to choose from • In almost all cases, you would choose availability over consistency
Consistency C Fox&Brewer “CAP Theorem”: C-A-P: choose two. Claim: every distributed system is on one side of the triangle. CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others (e.g., quorum). CA: available, and consistent, unless there is a partition. A P AP: a reachable replica provides service even in a partition, but may be inconsistent if there is a failure. Availability Partition-resilience
Availability • Traditionally, thought of as the server/process available five 9’s (99.999 %). • However, for large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes. • Want a system that is resilient in the face of network disruption
Consistency Model • A consistency model determines rules for visibility and apparent order of updates. • For example: • Row X is replicated on nodes M and N • Client A writes row X to node N • Some period of time t elapses. • Client B reads row X from node M • Does client B see the write from client A? • Consistency is a continuum with tradeoffs • For NoSQL, the answer would be: maybe • CAP Theorem states: Strict Consistency can't be achieved at the same time as availability and partition-tolerance.
Eventual Consistency • When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent • For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service • Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID • Soft state: copies of a data item may be inconsistent • Eventually Consistent – copies becomes consistent at some later time if there are no more updates to that data item • Basically Available – possibilities of faults but not a fault of the whole system
What kinds of NoSQL • NoSQL solutions fall into two major areas: • Key/Value or ‘the big hash table’. • Amazon S3 (Dynamo) • Voldemort • Scalaris • Schema-less which comes in multiple flavors, column-based, document-based or graph-based. • Cassandra (column-based) • CouchDB (document-based) • Neo4J (graph-based) • HBase (column-based)
Categories of NoSQL databases • Key-value stores • Column NoSQL databases • Document-based • Graph database (neo4j, InfoGrid) • XML databases (myXMLDB, Tamino, Sedna)
Key/Value Pros: • very fast • very scalable • simple model • able to distribute horizontally Cons: - many data structures (objects) can't be easily modeled as key value pairs
Schema-Less Pros: - Schema-less data model is richer than key/value pairs • eventual consistency • many are distributed • still provide excellent performance and scalability Cons: - typically no ACID transactions or joins
Common Advantages • Cheap, easy to implement (open source) • Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned • Down nodes easily replaced • No single point of failure • Easy to distribute • Don't require a schema • Can scale up and down • Relax the data consistency requirement (CAP)
Typical NoSQL API • Basic API access: • get(key) -- Extract the value given a key • put(key, value) -- Create or update the value given its key • delete(key) -- Remove the key and its associated value • execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).
What am I giving up? • joins • group by • order by • ACID transactions • SQL as a sometimes frustrating but still powerful query language • easy integration with other applications that support SQL
Cassandra • Amazon Dynamo • Consistent hashing • Partitioning • Replication • One-hop routing • Google BigTable • Column Families • Memtables • SSTables
Origins Pre-2008 2008 2009
Distributed and Scalable • Horizontal! • All nodes are identical • No master or Single point of Failure • Adding is simple • Automatic cluster maintenance
Data Model • Within Cassandra, you will refer to data this way: • Column: smallest data element, a tuple with a name and a value
Data Model A single column
Data Model A single row
Name Name Name Value Value Value Acme Jet Propelled Unicycle Little Giant Do-It-Yourself Rocket-Sled Kit Rocket-Powered Roller Skates toon toon toon Beep Prepared Ready, Set, Zoom Hot Rod and Reel inventoryQty inventoryQty inventoryQty 4 5 1 wheels brakes brakes false 1 false Data Model ColumnFamily: Rockets Key Value 1 name 2 name 3 name
Data: Schema-free Sparse-table • Flexible column naming • You define the sort order • Not required to have a specific column just because another row does
Consistent Hashing • Partition using consistent hashing • Keys hash to a point on a fixed circular space • Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots • Nodes take positions on the circle. • A, B, and D exists. • B responsible for AB range. • D responsible for BD range. • A responsible for DA range. • C joins. • B, D split ranges. • C gets BC from D.
Inserting: Overview • Simple: put(key, col, value) • Complex: put(key, [col:value, …, col:value]) • Batch: multi key.
Inserting: Writes • Commit log for durability • Configurable fsync • Sequential writes only • Memtable – no disk access (no reads or seeks) • Sstables are final (become read only) • Indexes • Bloom filter • Raw data • Bottom line: FAST!!!
Writes • Need to be lock-free and fast (no reads or disk seeks) • Client sends write to one front-end node in Cassandra cluster • Front-end = Coordinator, assigned per key • Which (via Partitioning function) sends it to all replica nodes responsible for key • Always writable: Hinted Handoff mechanism • If any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up. • When all replicas are down, the Coordinator (front end) buffers writes (for up to a few hours). • Provides Atomicity for a given key (i.e., within a ColumnFamily) • One ring per datacenter • Per-DC coordinator elected to coordinate with other DCs • Election done via Zookeeper, which runs a Paxos variant
Writes at a replica node On receiving a write • 1. log it in disk commit log • 2. Make changes to appropriate memtables • In-memory representation of multiple key-value pairs • Later, when memtable is full or old, flush to disk • Data File: An SSTable (Sorted String Table) – list of key value pairs, sorted by key • Index file: An SSTable of (key, position in data sstable) pairs • And a Bloom filter (for efficient search) – next slide • Compaction: Data udpates accumulate over time and SStables and logs need to be compacted • Merge SSTables, e.g., by merging updates for a key • Run periodically and locally at each server • Reads need to touch log and multiple SSTables • A row may be split over multiple SSTables • Reads may be slower than writes
Querying: Overview • You need a key or keys: • Single: key=‘a’ • Range: key=‘a’ through ’f’ • And columns to retrieve: • Slice: cols={bar through kite} • By name: key=‘b’ cols={bar, cat, llama} • Nothing like SQL “WHERE col=‘faz’” • But secondary indices are being worked on (see CASSANDRA-749)
Querying: Reads • Practically lock free • Sstable proliferation • New in 0.6: • Row cache (avoid sstable lookup, not write-through) • Key cache (avoid index scan)
Deletes and Reads • Delete: don’t delete item right away • Add a tombstone to the log • Compaction will eventually remove tombstone and delete item • Read: Similar to writes, except • Coordinator can contacts a number of replicas (e.g., in same rack) specified by consistency level • Forwards read to replicas that have responded quickest in past • Returns latest timestamp value • Coordinator also fetches value from multiple replicas • check consistency in the background, initiating a read-repair if any two values are different • Brings all replicas up to date • A row may be split across multiple SSTables => reads need to touch multiple SSTables => reads slower than writes (but still fast)
Bloom Filter • Compact way of representing a set of items • Checking for existence in set is cheap • Some probability of false positives: an item not in set may check true as being in set • Never false negatives Large Bit Map 0 • On insert, set all hashed bits. • On check-if-present, • return true if all hashed bits set. • False positives • False positive rate low • k=4 hash functions • 100 items • 3200 bits • FP rate = 0.02% 1 2 3 Hash1 Key-K Hash2 . . 69 Hashk 111 127
Cassandra uses Quorums • Quorum = way of selecting sets so that any pair of sets intersect • E.g., any arbitrary set with at least Q=N/2 +1 nodes • Where N = total number of replicas for this key • Reads • Wait for R replicas (R specified by clients) • In the background, check for consistency of remaining N-R replicas, and initiate read repair if needed • Writes come in two default flavors • Block until quorum is reached • Async: Write to any node • R = read replica count, W = write replica count • If W+R > N and W > N/2, you have consistency, i.e., each read returns the latest written value • Reasonable: (W=1, R=N) or (W=N, R=1) or (W=Q, R=Q)
Wrapping Up • Use Cassandra if you want/need • High write throughput • Near-linear scalability • Automated replication/fault tolerance • Can tolerate missing RDBMS features