1 / 21

Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG. The NoSQL Movement. Meetup June 11 2009 in San Francisco NoSQL name proposed by Eric Evans. Hadoop/HBase (Yahoo). 2004 BigTable (Google) 2007 Dynamo (Amazon) 2008 Cassandra (Facebook). Project Voldemort (LinkedIn).

faraji
Download Presentation

Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

  2. The NoSQL Movement Meetup June 11 2009 in San Francisco NoSQL name proposed by Eric Evans Hadoop/HBase (Yahoo) 2004 BigTable (Google) 2007 Dynamo (Amazon) 2008 Cassandra (Facebook) Project Voldemort (LinkedIn) NoSQL Conferences

  3. Relational Database/SQL

  4. Database Timeline 1970 Codd Relational Model 1989 SQL-89 1999 SQL:1999 Object Relational 1979 Oracle 1992 SQL-92 2003 SQL:2003 Analytics extensions 1974 SEQUEL 1970 1970 1980 1990 2000 2010 1980 Gray Transaction 1995 Bernstein et al Critique of ANSI SQL Isolation Levels 1969 CODASYL - Network database - Schema - DDL/DML 1981 Bernstein and Goodman Multi-version Concurrency Control

  5. Relational Model Table – n-tuple • Normalized data • “Atomic” • Multi-column Key • Operations on tables: • select, project, join • Relationship on key • Primary Key • Foreign Key Column Row Key

  6. SQL • Designed for Transaction Processing • Good • Easily handles simple cases • Everyone has a Query Language • Bad • Data access language (not Turing complete) • Declarative Language (4GL) • Impedance mismatch with procedural languages • Complicated cases get repetitive

  7. Normalization • Refine design of structured data • “Atomic” • No repeating groups • Data item depends on key (and nothing else) • Avoid modification anomalies • Ensure every data item is stored only once • Avoid bias to any particular pattern of querying • Allow data to be accessed from every angle • Denormalization

  8. Star Schema Example Fact Table Date_key Store_key Promotion_key Product_key Receipt_number Quantity Revenue Unit_price Date_key Day_in_week Day_in_month Day_in_year Day_name Week_in_month Week_in_year Month_nbr Month_name Quarter Year Holiday Holiday_desc … Product Promotion Store Date

  9. Database Summary • Costs • Fixed schema • Normalization • Transform data on load • Cost of scaling • Problems with large objects • Complicated software • Benefits • Mature technology • Precise querying • Star Schema – historic data

  10. Tuple Store/NoSQL

  11. Tuple Storage Systems • Google Database System • Chubby – Lock/metadata manager • Google File System – Distributed file system • Bigtable – Tuple storage on GFS • Map Reduce – Data processing on tuples • Other tuple stores • Voldemort – Amazon Dynamo • Cassandra • HBase • Hypertable

  12. Tuple Store Model • One Table • Operate on Map • Set of (Key, Value) • Structured Key • Unstructured Value • Operations: • select, project • Map Reduce Tuple Store Key Value Key Column Timestamp

  13. Map Reduce • Define two functions • Map • Input: tuple • Output: list of tuples • Reduce • Input: key, list of values • Output: list or tuple • Specify a cluster • Specify input and output tuple stores • Framework does the rest { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { list(v3)} -> { (k2, v3) }

  14. URL Web Page URL Web Page URL Web Page URL Web Page … Map Reduce Example For each web page count the number of pages that reference that page Input tuple store is WWW { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { (k2, v3)} Map Function: for each anchor on web page, emit (anchorURL, 1) Output tuple store is{ (URL, count) } Reduce Function: emit (anchorURL, sum(list))

  15. Example in SQL For each web page count the number of pages that reference that page CREATE TABLE links ( URL page NOT NULL, URL ref_page NOT NULL, PRIMARY KEY page, ref_page ) SELECT ref_page, count(DISTINCT page) FROM links GROUP BY ref_page

  16. Tuple Store Summary • Semi-structured data • No need to normalize data • Simple implementations • Cheap, fast, scalable • Map Reduce Processing • Simple programming (for geeks) • Issues • No guidance from schema • No model for historic data Hadoop wins Sort Benchmark

  17. Synthesis

  18. SQL Structured data Precise Historic data Needs transformation Scalability issues NoSQL Cheap Scalable Handles large data Summary

  19. Enterprise Model Money Content Analytics ? Relational DB NoSQL Issues: - Data volume - Query requirements Metadata?

  20. Analytics Architecture Map Reduce Processing Tuple Store Reports TB+/day GB++/day RDB Data Warehouse Cubes Reports etc.

  21. Summary It is all about structured data How much do we want? How much can we afford?

More Related