230 likes | 382 Views
Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG. The NoSQL Movement. Meetup June 11 2009 in San Francisco NoSQL name proposed by Eric Evans. Hadoop/HBase (Yahoo). 2004 BigTable (Google) 2007 Dynamo (Amazon) 2008 Cassandra (Facebook). Project Voldemort (LinkedIn).
E N D
Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG
The NoSQL Movement Meetup June 11 2009 in San Francisco NoSQL name proposed by Eric Evans Hadoop/HBase (Yahoo) 2004 BigTable (Google) 2007 Dynamo (Amazon) 2008 Cassandra (Facebook) Project Voldemort (LinkedIn) NoSQL Conferences
Database Timeline 1970 Codd Relational Model 1989 SQL-89 1999 SQL:1999 Object Relational 1979 Oracle 1992 SQL-92 2003 SQL:2003 Analytics extensions 1974 SEQUEL 1970 1970 1980 1990 2000 2010 1980 Gray Transaction 1995 Bernstein et al Critique of ANSI SQL Isolation Levels 1969 CODASYL - Network database - Schema - DDL/DML 1981 Bernstein and Goodman Multi-version Concurrency Control
Relational Model Table – n-tuple • Normalized data • “Atomic” • Multi-column Key • Operations on tables: • select, project, join • Relationship on key • Primary Key • Foreign Key Column Row Key
SQL • Designed for Transaction Processing • Good • Easily handles simple cases • Everyone has a Query Language • Bad • Data access language (not Turing complete) • Declarative Language (4GL) • Impedance mismatch with procedural languages • Complicated cases get repetitive
Normalization • Refine design of structured data • “Atomic” • No repeating groups • Data item depends on key (and nothing else) • Avoid modification anomalies • Ensure every data item is stored only once • Avoid bias to any particular pattern of querying • Allow data to be accessed from every angle • Denormalization
Star Schema Example Fact Table Date_key Store_key Promotion_key Product_key Receipt_number Quantity Revenue Unit_price Date_key Day_in_week Day_in_month Day_in_year Day_name Week_in_month Week_in_year Month_nbr Month_name Quarter Year Holiday Holiday_desc … Product Promotion Store Date
Database Summary • Costs • Fixed schema • Normalization • Transform data on load • Cost of scaling • Problems with large objects • Complicated software • Benefits • Mature technology • Precise querying • Star Schema – historic data
Tuple Storage Systems • Google Database System • Chubby – Lock/metadata manager • Google File System – Distributed file system • Bigtable – Tuple storage on GFS • Map Reduce – Data processing on tuples • Other tuple stores • Voldemort – Amazon Dynamo • Cassandra • HBase • Hypertable
Tuple Store Model • One Table • Operate on Map • Set of (Key, Value) • Structured Key • Unstructured Value • Operations: • select, project • Map Reduce Tuple Store Key Value Key Column Timestamp
Map Reduce • Define two functions • Map • Input: tuple • Output: list of tuples • Reduce • Input: key, list of values • Output: list or tuple • Specify a cluster • Specify input and output tuple stores • Framework does the rest { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { list(v3)} -> { (k2, v3) }
URL Web Page URL Web Page URL Web Page URL Web Page … Map Reduce Example For each web page count the number of pages that reference that page Input tuple store is WWW { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { (k2, v3)} Map Function: for each anchor on web page, emit (anchorURL, 1) Output tuple store is{ (URL, count) } Reduce Function: emit (anchorURL, sum(list))
Example in SQL For each web page count the number of pages that reference that page CREATE TABLE links ( URL page NOT NULL, URL ref_page NOT NULL, PRIMARY KEY page, ref_page ) SELECT ref_page, count(DISTINCT page) FROM links GROUP BY ref_page
Tuple Store Summary • Semi-structured data • No need to normalize data • Simple implementations • Cheap, fast, scalable • Map Reduce Processing • Simple programming (for geeks) • Issues • No guidance from schema • No model for historic data Hadoop wins Sort Benchmark
SQL Structured data Precise Historic data Needs transformation Scalability issues NoSQL Cheap Scalable Handles large data Summary
Enterprise Model Money Content Analytics ? Relational DB NoSQL Issues: - Data volume - Query requirements Metadata?
Analytics Architecture Map Reduce Processing Tuple Store Reports TB+/day GB++/day RDB Data Warehouse Cubes Reports etc.
Summary It is all about structured data How much do we want? How much can we afford?