Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

The NoSQL Movement Meetup June 11 2009 in San Francisco NoSQL name proposed by Eric Evans Hadoop/HBase (Yahoo) 2004 BigTable (Google) 2007 Dynamo (Amazon) 2008 Cassandra (Facebook) Project Voldemort (LinkedIn) NoSQL Conferences

Relational Database/SQL

Database Timeline 1970 Codd Relational Model 1989 SQL-89 1999 SQL:1999 Object Relational 1979 Oracle 1992 SQL-92 2003 SQL:2003 Analytics extensions 1974 SEQUEL 1970 1970 1980 1990 2000 2010 1980 Gray Transaction 1995 Bernstein et al Critique of ANSI SQL Isolation Levels 1969 CODASYL - Network database - Schema - DDL/DML 1981 Bernstein and Goodman Multi-version Concurrency Control

Relational Model Table – n-tuple • Normalized data • “Atomic” • Multi-column Key • Operations on tables: • select, project, join • Relationship on key • Primary Key • Foreign Key Column Row Key

SQL • Designed for Transaction Processing • Good • Easily handles simple cases • Everyone has a Query Language • Bad • Data access language (not Turing complete) • Declarative Language (4GL) • Impedance mismatch with procedural languages • Complicated cases get repetitive

Normalization • Refine design of structured data • “Atomic” • No repeating groups • Data item depends on key (and nothing else) • Avoid modification anomalies • Ensure every data item is stored only once • Avoid bias to any particular pattern of querying • Allow data to be accessed from every angle • Denormalization

Star Schema Example Fact Table Date_key Store_key Promotion_key Product_key Receipt_number Quantity Revenue Unit_price Date_key Day_in_week Day_in_month Day_in_year Day_name Week_in_month Week_in_year Month_nbr Month_name Quarter Year Holiday Holiday_desc … Product Promotion Store Date

Database Summary • Costs • Fixed schema • Normalization • Transform data on load • Cost of scaling • Problems with large objects • Complicated software • Benefits • Mature technology • Precise querying • Star Schema – historic data

Tuple Store/NoSQL

Tuple Storage Systems • Google Database System • Chubby – Lock/metadata manager • Google File System – Distributed file system • Bigtable – Tuple storage on GFS • Map Reduce – Data processing on tuples • Other tuple stores • Voldemort – Amazon Dynamo • Cassandra • HBase • Hypertable

Tuple Store Model • One Table • Operate on Map • Set of (Key, Value) • Structured Key • Unstructured Value • Operations: • select, project • Map Reduce Tuple Store Key Value Key Column Timestamp

Map Reduce • Define two functions • Map • Input: tuple • Output: list of tuples • Reduce • Input: key, list of values • Output: list or tuple • Specify a cluster • Specify input and output tuple stores • Framework does the rest { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { list(v3)} -> { (k2, v3) }

URL Web Page URL Web Page URL Web Page URL Web Page … Map Reduce Example For each web page count the number of pages that reference that page Input tuple store is WWW { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { (k2, v3)} Map Function: for each anchor on web page, emit (anchorURL, 1) Output tuple store is{ (URL, count) } Reduce Function: emit (anchorURL, sum(list))

Example in SQL For each web page count the number of pages that reference that page CREATE TABLE links ( URL page NOT NULL, URL ref_page NOT NULL, PRIMARY KEY page, ref_page ) SELECT ref_page, count(DISTINCT page) FROM links GROUP BY ref_page

Tuple Store Summary • Semi-structured data • No need to normalize data • Simple implementations • Cheap, fast, scalable • Map Reduce Processing • Simple programming (for geeks) • Issues • No guidance from schema • No model for historic data Hadoop wins Sort Benchmark

Synthesis

SQL Structured data Precise Historic data Needs transformation Scalability issues NoSQL Cheap Scalable Handles large data Summary

Enterprise Model Money Content Analytics ? Relational DB NoSQL Issues: - Data volume - Query requirements Metadata?

Analytics Architecture Map Reduce Processing Tuple Store Reports TB+/day GB++/day RDB Data Warehouse Cubes Reports etc.

Summary It is all about structured data How much do we want? How much can we afford?

Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG