510 likes | 670 Views
.NET and NoSQL Introducing Cassandra. John Zablocki Development Manager, HealtcareSource Organizer, Beantown ALT.NET Beantown ALT.NET 2011-10-26. New England Code Camp – 10/29/2011 WP7 Location @ Dev Boston Meetup – 11/3/2011 DDD w/ Steve Bohlen @ Beantown ALT.NET – 11/28/2011.
E N D
.NET and NoSQLIntroducing Cassandra John Zablocki Development Manager, HealtcareSource Organizer, Beantown ALT.NET Beantown ALT.NET 2011-10-26
New England Code Camp – 10/29/2011 • WP7 Location @ Dev Boston Meetup – 11/3/2011 • DDD w/ Steve Bohlen @ Beantown ALT.NET – 11/28/2011 Shameless Plugs
NoSQL Overview • Cassandra Basic Concepts • Cassandra Data Model • Client API • Cassandra and .NET • Questions? Agenda
NoSQL Not Only SQL
Coined in 1998 by Carlos Strozzi to describe a database that did not expose a SQL interface • In 2008, Eric Evans reintroduced the term to describe the growing non-RDBMS movement • Broadly refers to a set of data stores that do not use SQL or a relational data model • Popularized by large web presences such as Google, Facebook and Amazon What is NoSQL?
NoSQL databases come in a variety of flavors * • XML (myXMLDB, Tamino, Sedna) • Tabular (Hbase, Big Table) • Key/Value (Redis, Memcached with BerkleyDB) • Object (db4o, JADE) • Graph (Trinity, neo4j, InfoGrid) • Document store (CouchDB, MongoDB) • Eventually Consistent Key/Value Store (Cassandra, Dynamo)* loose taxonomies NoSQL Databases
RDBMS Administrators are highly paid • Highly paid individuals often buy larger than average homes or cars • Larger than average homes and cars require more energy than smaller home and cars • Therefore RDMBSs contribute to global warming more than NoSQL databases which typically do not require the addition of a DBA RDBMs and the Environment
RDBMSs often require high end servers and that are taxing on disks • High end servers consume more electricity than mid-range servers • Taxed disks fail more often than untaxed disks • Therefore RDBMSs require more energy and produce more waste (lots of hard drives in landfills) than NoSQL DBs, which run on mid-range servers.
The current healthcare crisis requires talented software engineers to fix the outdated or non-existent IT systems of the hospital system • Talented software engineers spend a great deal of time mapping objects to tables in RDBMSs • Talented software engineers are unable to fix healthcare because they are too busy mapping objects to tables • Therefore RDBMSs are causing illnessnes NoSQL and Healthcare
Three disruptive technologies you should be paying attention to today… • NoSQL databases and big data technologies - especially MongoDB, CouchDB, Cassandra, Hbase, MapReduce and Hadoop • Evented I/O Web Servers – especially Node.js and to a lesser extent Tornado • Functional programming languages – especially Scala, F# and Erlang Please Pardon the Interruption…
Open source, Apache supported project • Originally written by Facebook for Inbox search feature • FB now uses a proprietary fork • Written in Java. Yes, Java. • Column-oriented with row-oriented properties • Schemaless • Data stored in sparse, multidimensional hashtables • Sparse meaning that rows may have one or more columns • Distributed and Decentralized • Highly Available and Fault Tolerant • Elastic Scalability • Tunable Consistency • MapReduce via Hadoop About Cassandra
Column-Oriented • Content stored by column, rather than by row: • 1,2,3; Smith,Jones,Johnson; Joe,Mary,Cathy; 40000,50000,44000; • More efficient when an aggregate needs to be computed over many rows • More efficient when writing new values for a column to all rows at once • Better compression is possible, due to the fact that modern compression schemes make use of the similarity of adjacent data (column data is uniform) • Less efficient for multi-column reads • Less efficient for multi-column writes Column-Oriented
Cassandra is meant to run on multiple nodes • Single node is possible, but Cassandra’s benefits will not be realized • Every node is identical • No Master/Slave • Peer-to-peer protocol keeps data in sync (gossip) Distributed and Decentralized
Periodic, pairwise interactions • Bounded size information exchange • One agent changes the state of another • Reliable communication is not assumed • Low frequency of interactions to minimize protocol costs • Some form of randomness in peer selection Gossip Protocol
Vertical Scaling • Throw hardware at the problem • More memory, faster CPU, etc. • Horizontal Scaling (Clustering) • Add more machines • Possibly partition the data across machines • Elastic Scaling • Horizontal cluster that can scale up and scale down seamlessly • New nodes can be brought online and begin serving requests with partial data • New nodes come online without service distruption Elastic Scalability
Consistency - ensures transactions move a database from one consistent state to another • Cassandra supports tunable consistency • Strict (sequential) consistency – all nodes see all writes in the same order • A read always returns the most recent write • Causal consistency – potentially causally related operations seen by all nodes in the same order • Concurrent writes are not causally related • Timestamps used to determine the cause of events • Weak (eventual) consistency – all updates will propagate to all nodes, but not immediately • See Eric Brewer’s CAP Theorem Tunable Consistency
Large-scale distributed systems have three competing requirements • Consistency – all nodes see the same data at the same time • Availability – All clients will always be able to read and write data and all requests will receive a response of success or failure • Partition Tolerance – The system will continue to function, even in the face of network segmentation failures • Theorem states that a distributed system can satisfy only 2 of these 3 properties at the same time Brewer’s CAP Theorem
Consistency and Availability • Two-phase commit for distributed transactions • System blocks on a network partition • Consistency and Partition Tolerance • Pessimistic locking • Node failure hinders availability • Availability and Partition Tolerance • System always returns data, even if inaccurate • Optimistic locking • DNS, web caching Brewer’s CAP Theorem
Clusters (rings) • Set of nodes that appear as a single server • Single node is still a cluster • Container for keyspaces • Keyspaces • Analogous to a relational database • Has name and set of attributes to define keyspace-wide behavior • Replication factor (# of nodes will having row copy) • Replica placement strategy (how rows are copied) • Column Families Cassandra Data Model
Column Families • Analogous to a relational table • Container for an ordered collection of rows • Columns • Basic data structure in Cassandra • Consists of a name, value and clock (timestamp) • Defined with a key name sorting rule (ascii, integer, etc.) • Value sorting is not possible • Names and values stored as Java byte arrays • May be indexed for queries • Super Columns • A special column with values that are maps of subcolumns (standard columns) • Single level of nesting only • Subcolumns are not indexed – read a supercolumn and all of its columns are read as well Cassandra Data Model
System keyspacestores metadata about the cluster, similar to the master db in SQL Server • Peer-to-peer distribution model where behavior of each node is identical (no Master/Slave) • New node added to cluster without disruption • Accepts requests only after learning topology • Gossip protocol where gossiper runs every second on a timer • Each node has information about the others • Anti-entropy is the replica synchronization mechanism in Cassandra • Nodes exchange hashes of column family data in order to determine whether read-repair is needed Cassandra Architecture
Writes are immediately written to a commit logand subsequently written to an in-memory store called the memtable • At a specified threshold objects in the memtable are flushed to disk to an immutable structure called a sorted string table (SSTable) • Hinted handoffs allow nodes to receive a write intended for another node if that other node goes offline. The hint tells the receiving node to update the offline node when back online Cassandra Architecture
Compaction is the operation of merging SSTables • Keys are merged • Columns are combined • Tombstones are discarded • New index created • Merged data are sorted • Bloom filters are used to reduce disk access • Fast nondeterministic algorithms to determine whether an element is a member of a set • Tombstones are deletion markers on records • All delete commands in Cassandra are soft deletes Cassandra Architecture
Using Cassandra The Windows Experience
Install the Java 1.6 (or later) SDK • Set environment variable JAVA_HOME set to the install path of the JDK • Download the binaries from http://cassandra.apache.org/download/ • Unzip to Program Files (x86) or some other directory, optionally set PATH • Set environment variable CASSANDRA_HOME to directory above • In command line, navigate to bin under CASSANDRA_HOME and run cassandra Installing Cassandra on Windows
Command line interface • Navigate to bin, under CASSANDRA_HOME and run cassandra-cli • Generally useful for development, but not meant to be a full-blown client • Allows for basic administration (creating keyspaces, column management, etc.) • Commands must be terminated with a ; Cassandra-CLI
Connect to a server • connect localhost/9160; • Connect to a server at CLI start • cassandra-cli localhost/9160 • System information commands • show cluster name; • show keyspaces; • show api version; Cassandra-CLI
Create a keyspace • create keyspaceBeantownAltNet; • Switch to keyspace • use BeantownAltNet • Create a column family • create column family movies with comparator=UTF8Type and key_validation_class=UTF8Type; • View information about column family • describe keyspaceBeantownAltNet; Cassandra-CLI
See this JIRA issue and then run (v..8): • assume Movies keys as ascii; • Add a row of data • set movies[‘Goodfellas’][‘Genre’] = ‘Drama’; • set movies[‘Goodfellas’][‘Year’] = 1990; • Count the columns • count movies[‘Goodfellas’]; • Get the row and column • get movies[‘Goodfellas’]; • get movies[‘Goodfellas’][‘Genre’];
Create an index on Genre • update column family movies with column_metadata=[{column_name:Genre, index_type:0, index_name:IdxGenre, validation_type:UTF8Type}] • Query by genre • get movies where Genre = ‘Drama’; • Remove a column • del movies[‘Goodfellas’][‘Year’]; • Remove a row • del movies[‘Goodfellas’]; Cassandra-CLI
Used for Cassandra’s client API • Effectively an RPC serialization mechanism • Software framework for scalable, cross-language services development • Combines software stack with code generation to build services • Support for C++, Java, Python, PHP, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk and OcamlstructUserProfile { 1: i32 uid, 2: string name, 3: string blurb } service UserStorage { void store(1: UserProfile user), UserProfileretrieve(1: i32 uid) } Thrift
CQL is a DSL similar to SQL meant to abstract better the details of the server operations from the clients (still requires Thrift) • Currently, CQL drivers exist only for Java and Python • CREATE KEYSPACE BeantownAltNet with replication_factor=1; • CREATE COLUMNFAMILY movies ( key VARCHAR PRIMARY KEY, genre VARCHAR, year INT); • INSERT INTO movies (key, genre, year) VALUES (‘Zoolander’, ‘Comedy’, 1996); • SELECT key, genre, year FROM movies; • SELECT key, genre, year FROM movies WHERE genre=‘Comedy’; Cassandra Query Language (CQL)
Command line CQL tool that ships with the Python CQL driver • Windows installation • Grab the precompiled windows Thrift binaries for Python and copy to site-packages http://www.dreamcubes.com/b2/software-development/20/thrift-with-python-on-win32/ • Download cassandra-dbapi2 from http://code.google.com/a/apache-extras.org/p/cassandra-dbapi2/source/checkoutand run - setup.py install • easy_installpyreadline • Run - python cqlshlocalhost 9160 CQLSH
CREATE KEYSPACE foo WITH strategy_class=‘SimpleStrategy’ AND strategy_options:replication_factor=1; • CREATE COLUMNFAMILY users (key VARCHAR PRIMARY KEY, nickname VARCHAR); • INSERT INTO users (key, nickname) VALUES (‘jzablocki’, ‘zblock’); • SELECT * FROM users; CQLSH
.NET and Cassandra The Client Libraries
Currently, there are three well maintained , community sponsored client libraries • Cassandra-Sharp - http://code.google.com/p/cassandra-sharp/ • Aquiles - http://aquiles.codeplex.com/ • FluentCassandra - https://github.com/managedfusion/fluentcassandra • No official Apache client .NET Client Libraries
Configured in App/Web.config • Simple API over most common Thrift calls • Additional support for Cassandra commands via Execute method and Client class • Support for executing CQL Cassandra-Sharp
https://bitbucket.org/johnzablocki/codevoyeur-samples/src/1e2aeb969518/src/PresentationSamples/NoSQLAndDotNet/CassandraQuickStart-CassandraSharphttps://bitbucket.org/johnzablocki/codevoyeur-samples/src/1e2aeb969518/src/PresentationSamples/NoSQLAndDotNet/CassandraQuickStart-CassandraSharp Cassandra-Sharp Demo
Configured in App/Web.config • Simple wrapper over most common Thrift calls • No direct support for executing CQL (though an internal class does have CQL execution) Aquiles
https://bitbucket.org/johnzablocki/codevoyeur-samples/src/1e2aeb969518/src/PresentationSamples/NoSQLAndDotNet/CassandraQuickStart-Aquileshttps://bitbucket.org/johnzablocki/codevoyeur-samples/src/1e2aeb969518/src/PresentationSamples/NoSQLAndDotNet/CassandraQuickStart-Aquiles Aquiles Demo
Intended to be an idiomatic .NET Cassandra framework (i.e., more like .NET than Java) • Makes use of .NET 4.0 dynamic feature • Raw Thrift commands are abstracted • No current support for SQL • Developerd by Nick Berardi FluentCassandra
https://bitbucket.org/johnzablocki/codevoyeur-samples/src/1e2aeb969518/src/PresentationSamples/NoSQLAndDotNet/CassandraQuickStarthttps://bitbucket.org/johnzablocki/codevoyeur-samples/src/1e2aeb969518/src/PresentationSamples/NoSQLAndDotNet/CassandraQuickStart FluentCassandra Demo
Non-Relational Design Codd is Dead
Materialized View • Store redundant data for more efficient queriesMovieGenres[‘Drama’][‘Goodfellas’] = null;MovieGenres[‘Drama’][‘Casino’] = null; • Valueless Column • All data necessary to satisfy a query is in the column. No value needed (see above) • Aggregate Key • Combine values with a delimiter to create a composite keyZipCodes[‘Wethersfield:CT’] = ‘06109’;ZipCodes[‘Cambridge:MA’] = ‘02140’; Design Patterns
http://dllhell.net – my blog • http://codevoyeur.com – code projects • http://linkedin.com/in/johnzablocki • http://twitter.com/codevoyeur • http://cassandra.apache.org/ • http://bitbucket.org/johnzablocki/codevoyeur-samples - code from this presentation • http://shop.oreilly.com/product/0636920010852.do - O’Reilly’s Cassandra - The Definitive Guide • http://about.me/johnzablocki Links