Modeling in Cassandra (C*)

Phoenix Cassandra Users - July 2013 Modeling in Cassandra (C*) • All examples using C* 1.2.x and CQL3

Agenda • Me • A bit about our use case • Data Model • Modeling • Closing thoughts • Q&A • References

Me • 15+ years developing enterprise software applications for Supply Chain, Finance, and Human Resources • Interested in Big Data technologies, Agile methodologies, Math, Robotics, Music, Cycling • LinkedIn: http://linkd.in/18rG6oz

A bit about our use case • Currently ingesting ~70 million records a day and growing as we get more customers • Using C* 1.1 (DataStax 2.2) and using CQL3 beta; planning to move to 1.2 • Need to delete data without impacting the system • Data size is 1TB for ~30 days worth of customer data • One use case needs to produce customer reports within 2 hours after the data has been ingested • Currently looking back 15 days, but need to be able to analyze and lookup data up to 180 days • Some use cases require up to 13 months

Data Model

Data Model Don’t use this to design your model in C*!

Data Model • Map of a map • Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

Data Model • Benefits: • Efficient key lookup • Efficient scans due to natural sort of the columns • Max number of cells (rows x columns) in a single partition is 2 billion

Modeling

Modeling • ... but isn’t this NoSQL?

Modeling • Business requirements • Data structure

Modeling • Relational model for “library” system

Modeling • what users are reading this book?

Modeling • who is reading this book?

Modeling • C* model looks similar CREATE TABLE books ( ISBN text, title text, pub_date varchar, PRIMARY KEY (ISBN)); CREATE TABLE users ( userid text, fname text, lname text, PRIMARY KEY (userid));

Modeling • But ... there are no joins in C*!!

Modeling • However, all the users for a book are in a single row! • Sequential reads Wide row! CREATE TABLE users_by_book ( ISBN text, userid text, checkout_dt timeuuid, PRIMARY KEY (ISBN, userid));

Modeling • But you just said there are no joins in C*! • De-normalization is the way, depending on the use case • C* is great for writes CREATE TABLE users_by_book ( ISBN text, userid text, fname text, lname text, title text, checkout_dt timeuuid, PRIMARY KEY (ISBN, userid));

Modeling • We need to know the books a user is reading CREATE TABLE books_by_user ( ISBN text, userid text, fname text, lname text, title text, checkout_dt timeuuid, PRIMARY KEY (userid, ISBN));

Modeling • what if I want the last 5 books a user has recently checked out?

Modeling • what if I want the last 5 books a user has recently checked out? CREATE TABLE books_by_user ( ISBN text, userid text, fname text, lname text, title text, checkout_dt timestamp, PRIMARY KEY (userid, checkout_dt, ISBN)) WITH CLUSTERING ORDER BY (checkout_dt DESC); Sort columns in reverse order

Modeling • what if I want the last 5 books a user has recently checked out? CREATE TABLE books_by_user ( ISBN text, userid text, fname text, lname text, title text, checkout_dt timestamp, PRIMARY KEY (userid, checkout_dt, ISBN)) WITH CLUSTERING ORDER BY (checkout_dt DESC); Sort columns in reverse order select userid, ISBN, title from books_by_user where userid = ‘Ivanovich’ LIMIT 5;

Modeling ... so far • Don’t think relational tables • Start with entities and relationships, if possible • Use query patterns • Duplicate and de-normalize data as needed

Modeling ... a bit more • We want to know books checked out by branch and date

Modeling ... a bit more • We want to know books checked out by branch and date CREATE TABLE books_by_branch_date ( ISBN text, branch text, checkout_date varchar, title text, PRIMARY KEY (branch, checkout_date, ISBN));

Modeling ... a bit more • We want to know books checked out by branch and date CREATE TABLE books_by_branch_date ( ISBN text, branch text, checkout_date varchar, title text, PRIMARY KEY (branch, checkout_date, ISBN)); Wait! This will cluster all the books for all dates in a single row for the branch!

Modeling ... a bit more • We want to know books checked out by branch and date CREATE TABLE books_by_branch_date ( ISBN text, branch text, checkout_date varchar, title text, PRIMARY KEY ((branch, checkout_date), ISBN));

Modeling ... a bit more • We want to know books checked out by branch and date CREATE TABLE books_by_branch_date ( ISBN text, branch text, checkout_date varchar, title text, PRIMARY KEY ((branch, checkout_date), ISBN)); Composite row key!

Modeling ... a bit more • Let’s add some tags to our books ... CREATE TABLE books ( ISBN text, title text, pub_date varchar, tags set<text>, PRIMARY KEY (ISBN));

Modeling ... a bit more • Let’s add some tags to our books ... CREATE TABLE books ( ISBN text, title text, pub_date varchar, tags set<text>, PRIMARY KEY (ISBN)); Collections! Set, List, and Map

Modeling ... a bit more • Let’s add some tags to our books ... CREATE TABLE books ( ISBN text, title text, pub_date varchar, tags set<text>, PRIMARY KEY (ISBN)); Collections! Set, List, and Map cqlsh:test> select * from books; isbn | pub_date | tags | title---------------------+-------------+-------------------+------------------ 1-123456-89762 | 09011963 | {Heroes, Marvel} | Avengers 101

Modeling ... a bit more • Purge data older than 30 days in our book by branch and date tracking

Modeling ... a bit more • Purge data older than 30 days in our book by branch and date tracking • Try to delete data by date? Hmm ...

Modeling ... a bit more • Purge data older than 30 days in our book by branch and date tracking • TTL to the rescue!

Modeling ... a bit more • Purge data older than 30 days in our book by branch and date tracking • TTL to the rescue! insert into books_by_branch_date( isbn, branch, checkout_date, title) values ('1-123456-89762', 'Mustang','07222013', 'Avengers 101') USING TTL 2592000; // 30 days in seconds

Closing thoughts • Think about query patterns and data usages • Try out things and see what works for your use case • Use UUID (Universal Unique ID) - 128 bit number for surrogate keys • Use natural keys with caution • Use latest C* drivers (Java, Python, C#) • Learn THY C*!! • Don’t be shy and ask the community when in doubt

Q & A

References • http://wiki.apache.org/cassandra/DataModel • http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/ • http://www.datastax.com/documentation/cassandra/1.2/

Modeling in Cassandra (C*)