940 likes | 1.15k Views
Big Data Analytics Map-Reduce and Spark. ?. Slides by: Joseph E. Gonzalez jegonzal@cs.berkeley.edu. Guest Lecturer: Vikram Sreekanti. From SQL to Big Data (with SQL). A few weeks ago… Databases (Relational) Database Management Systems SQL: Structured Query Language Today
E N D
Big Data AnalyticsMap-Reduce and Spark ? Slides by: Joseph E. Gonzalez jegonzal@cs.berkeley.edu Guest Lecturer: Vikram Sreekanti
From SQL to Big Data (with SQL) • A few weeks ago… • Databases • (Relational) Database Management Systems • SQL: Structured Query Language • Today • More on databases and database design • Enterprise data management and the data lake • Introduction to distributed data storage and processing • Spark
Data in the Organization A little bit of buzzword bingo!
Inventory How we like to think of data in the organization
Sales (Asia) The reality… Sales (US) Inventory Advertising
Data Warehouse Operational Data Stores Sales (Asia) Collects and organizes historical data from multiple sources • Capture the now • Many different databases across an organization • Mission critical… be careful! • Serving live ongoing business operations • Managing inventory • Different formats (e.g., currency) • Different schemas (acquisitions ...) • Live systems often don’t maintain history Sales (US) Inventory Advertising We would like a consolidated, clean, historical snapshot of the data.
Data Warehouse Sales (Asia) ETL Collects and organizes historical data from multiple sources ETL Sales (US) ETL Data is periodicallyETLed into the data warehouse: • Extracted from remote sources • Transformed to standard schemas • Loaded into the (typically) relational (SQL) data system ETL Inventory Advertising
Extract Transform Load (ETL) Extract & Load: provides a snapshot of operational data • Historical snapshot • Data in a single system • Isolates analytics queries (e.g., Deep Learning) from business critical services (e.g., processing user purchase) • Easy! Transform: clean and prepare data for analytics in a unified representation • Difficult often requires specialized code and tools • Different schemas, encodings, granularities
Data Warehouse Sales (Asia) ETL Collects and organizes historical data from multiple sources ETL Sales (US) ETL How is data organized in the Data Warehouse? ETL Inventory Advertising
Example Sales Data • Big table: many columns and rows • Substantial redundancy expensive to store and access • Make mistakes while updating • Could we organize the data more efficiently?
Multidimensional Data Model SalesFact Table Locations 5 1 2 • Dimension • Tables Location Id • Multidimensional “Cube” of data 13 Products Product Id 12 11 1 2 3 Time Id 25 30 25 8 30 8 25 30 8 10 10 8 20 8 10 20 8 20 10 15 50 15 10 50 10 15 50 Time
Multidimensional Data Model SalesFact Table Locations • Dimension • Tables • Fact Table • Minimizes redundant info • Reduces data errors • Dimensions • Easy to manage and summarize • Rename: Galaxy1 Phablet • Normalized Representation • How do we do analysis? • Joins! Products Time
The Star Schema Products Time Sales Fact Table This looks like a star … Locations
The Star Schema This looks like a star …
The Star Schema This looks like a star …
The Snowflake Schema This looks like a snowflake …?
Which schema illustration would best organize this data? http://bit.ly/ds100-sp18-star Data(calid, student_name, year, major, major_grade_req, asg1_name, asg1_pts, asg1_score, asg1_grader_name asg2_name, asg2_pts, asg2_score, asg2_grader_name avg_grade) Majors Students Graders Graders Students Grades Grades Grades Graders (A) Students (B) (C) Assignments Assignments Majors
Data(calid, student_name, year, major, major_grade_req, asg1_name, asg1_pts, asg1_score, asg1_grader_name asg2_name, asg2_pts, asg2_score, asg2_grader_name avg_grade) Which schema illustration would best organize this data? • Grades(calid, asg_name, grader_id, score) • Graders(grader_id, grader_name) • Student(calid, name, year, major_name, avg_grade) • Majors(major_name, grade_req) • Assignments(asg_name, asg_pts) Majors Students Graders Graders Students Grades Grades Grades Graders (A) Students (B) (C) Assignments Assignments Majors
Online Analytics Processing (OLAP) Users interact with multidimensional data: • Constructing ad-hoc and often complex SQL queries • Using graphical tools that to construct queries • Sharing views that summarize data across important dimensions
Cross Tabulation (Pivot Tables) • Aggregate data across pairs of dimensions • Pivot Tables:graphical interface to select dimensions and aggregation function (e.g., SUM, MAX, MEAN) • GROUP BY queries • Related to contingency tables and marginalization in stats. • What about many dimensions?
* 5 Cube Operator Location Id 2 63 38 75 176 63 38 75 176 1 75 176 8 10 10 28 63 38 75 176 8 10 10 28 80 63 63 38 38 27 75 75 75 * 30 20 50 8 10 10 100 28 30 20 50 13 100 • Generalizes cross-tabulation to higher dimensions. 25 8 15 30 20 50 48 100 25 8 15 Product Id 12 48 25 8 15 48 11 • In SQL: • SELECT Item, Color, SUM(Quantity) ASQtySumFROM Furniture • GROUPBYCUBE (Item, Color); What is here? * Time Id 1 2 3
OLAP Queries • Slicing:selecting a value for a dimension • Dicing:selecting a range of values in multiple dimension 8 25 3 30 33 9 25 30 33 3 9 25 30 8 3 25 8 20 10 7 42 8 8 10 42 7 7 8 5 20 13 6 15 5 15 10 50 10 50 6 5 15 2 42 6 15 7 2 33 42 42 5 5 8 10 10 10 10 9 2 2 30 20 20 50 50
OLAP Queries • Rollup:Aggregating along a dimension • Drill-Down:de-aggregating along a dimension Day 3 Day 3 Day 2 Day 2 30 33 25 30 8 25 30 9 8 30 8 3 33 9 25 25 3 3 8 30 30 8 25 25 8 9 59 33 93 87 8 42 20 10 7 20 8 20 42 10 10 20 8 10 8 7 10 91 20 8 120 20 42 10 42 7 8 2 56 6 67 10 50 10 15 50 10 10 75 5 15 6 50 50 10 15 5 5 2 50 6 50 10 15 15 2 15 Day 1 Day 1 15 15 15 15 15 15 6 6 6 6 6 6 8 8 8 8 8 8 23 4 4 4 23 23 23 23 4 4 4 23 7 10 10 7 7 7 10 10 7 10 7 10 36 36 5 36 36 5 5 5 5 5 36 36
Reporting and Business Intelligence (BI) • Use high-level tools to interact with their data: • Automatically generate SQL queries • Queries can get big! • Common!
Data Warehouse Sales (Asia) ETL Collects and organizes historical data from multiple sources ETL Sales (US) ETL • So far … • Star Schemas • Data cubes • OLAP Queries ETL Inventory Advertising
Data Warehouse ETL Sales (Asia) Collects and organizes historical data from multiple sources ETL ETL Sales (US) ETL • How do we deal with semi-structured and unstructured data? • Do we really want to force a schema on load? ETL ? Text/Log Data Photos & Videos Inventory Advertising It is Terrible!
Data Warehouse How do we clean and organize this data? ETL Sales (Asia) Collects and organizes historical data from multiple sources ETL Depends on use … ETL Sales (US) ETL • How do we deal with semi-structured and unstructured data? • Do we really want to force a schema on load? ETL ? How do we load and process this data in a relational system? Text/Log Data Photos & Videos Depends on use… Can be difficult... Requires thought... Inventory Advertising It is Terrible!
Data Warehouse Data Lake * How do we clean and organize this data? ETL Sales (Asia) ETL Store a copy of all the data • in one place • in its original “natural” form Enable data consumers to choose how to transform and use data. • Schema on Read Depends on use … ETL Sales (US) ETL ETL ? How do we load and process this data in a relation system? Text/Log Data Photos & Videos Depends on use … Can be difficult ... Requires thought ... Inventory Advertising What could go wrong? It is Terrible! *Still being defined…[Buzzword Disclaimer]
The Dark Side of Data Lakes • Cultural shift: Curate Save Everything! • Noise begins to dominate signal • Limited data governance and planning Example: hdfs://important/joseph_big_file3.csv_with_json • What does it contain? • When and who created it? • No cleaning and verification lots of dirty data • New tools are more complex and old tools no longer work Enter the data scientist
A Brighter Future for Data Lakes Enter the data scientist • Data scientists bring new skills • Distributed data processing and cleaning • Machine learning, computer vision, and statistical sampling • Technologies are improving • SQL over large files • Self describing file formats & catalog managers • Organizations are evolving • Tracking data usage and file permissions • New job title: data engineers
How do we store and compute on large unstructured datasets • Requirements: • Handle very large files spanning multiple computers • Use cheap commodity devices that fail frequently • Distributed data processing quickly and easily • Solutions: • Distributed file systems spread data over multiple machines • Assume machine failure is common redundancy • Distributed computing load and process files on multiple machines concurrently • Assume machine failure is common redundancy • Functional programming computational pattern parallelism
Fault Tolerant Distributed File Systems Big File How do we store and access very large files across cheapcommodity devices ? [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Big File Machine1 Machine 2 Machine 3 Machine 4 [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems A A A Big File B B B • Split the file into smaller parts. • How? • Ideally at record boundaries • What if records are big? C C D D C D Machine1 Machine 2 Machine 3 Machine 4 [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Big File B B B C C D D C D Machine1 Machine 2 Machine 3 Machine 4 A A A [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Big File C C D D C D Machine1 Machine 2 Machine 3 Machine 4 B A B A A B [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Big File D D D Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B C [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Big File • Split large files over multiple machines • Easily support massive files spanning machines • Read parts of file in parallel • Fast reads of large files • Often built using cheap commoditystorage devices Machine1 Machine 2 B C A B D C Machine 3 Machine 4 Cheap commodity storage devices will fail! A C A B D D [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Failure Event Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Failure Event Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Failure Event Big File Machine 2 Machine 4 A B A B A B C D C D [Ghemawat et al., SOSP’03]
Fault Tolerant Distributed File Systems Failure Event B Big File A D C Machine 2 Machine 4 A B A B C D [Ghemawat et al., SOSP’03]
Map-ReduceDistributed Aggregation Computing are very large files
How would you computethe number of occurrences of each word in all the books using a team of people?
Simple Solution 1) Divide Books Across Individuals
Simple Solution 1) Divide Books Across Individuals 2) Compute Counts Locally
Simple Solution 1) Divide Books Across Individuals 2) Compute Counts Locally 3) Aggregate Tables