Dremel: Interactive Analysis of Web-Scale Datasets

Dremel: Interactive Analysis of Web-Scale DatasetsRidvan DongelciDepartment of Information and Computer ScienceAalto University, School of Science and Technologyridvan.dongelci@aalto.fiApril 15, 2013

Dremel: Interactive Analysis of Web-Scale Datasets • Dremel and Motivation • Columnar Storage • Query Language and Execution • Experiments • Observations and Conclusion Dremel: Interactive Analysis of Web-Scale Datasets 2

Background • Web-scale Dataset • Data Exploration, Rapid Prototyping • Speed Matters SLOW! Dremel: Interactive Analysis of Web-Scale Datasets

Dremel: Interactive Analysis of Web-Scale Datasets Dremel • Interactive Analysis of Web-Scale Dataset • Scalable, Fault Tolerant, Fast • Analysis on in situ data • Bigtable, Google File System • Widely Used In Google • BigQuery, Google Books, Web Analysis In place data

Dremel: Interactive Analysis of Web-Scale Datasets Dremel Key Concepts • Nested Columnar Storage • Google Protocol Buffer for Processing and Storing • SQL-like Query Language • Execution with Serving Trees • Inspired from Web Search

Dremel: Interactive Analysis of Web-Scale Datasets Data Model • Record-wise vs. Columnar Storage • SELECT SUM(A.B.C) FROM t

Dremel: Interactive Analysis of Web-Scale Datasets Nested Columnar Storage • Repetition and Definition Levels r1.Name1.Language.Code ‘en-us’ r1.Name1.Language.Code ‘en’ r1.Name2 r1.Name3 ‘en-gb’ r2.Name1

Dremel: Interactive Analysis of Web-Scale Datasets Nested Columnar Storage • Splitting Records into Columns • Record Assembly

Dremel: Interactive Analysis of Web-Scale Datasets Tablet Layout and Tricks • Tablet Storage and Horizontal Partitioning • Save Space • Nulls are not stored • Definition levels are not stored if always defined • Repetition levels are only stored when needed • Levels are packed as bit sequence

Dremel: Interactive Analysis of Web-Scale Datasets Query Language • SQL-like Language • Efficient on columnar storage • Input one or multiple table and their schema • Outputs a table and its schema • WHERE prunes branches • Support Following Operations • Nested sub-queries, inter/intra-record aggregation • Top K queries, Joins, User defined functions

Dremel: Interactive Analysis of Web-Scale Datasets Query Language Example

Dremel: Interactive Analysis of Web-Scale Datasets Query Execution • Many Queries are one pass • Execution on Serving Trees • Parallel scheduling and aggregation • Fault tolerance and deal with stragglers • Root Server Receives Incoming Query • Fetches Metadata and Schema • Determines Tablets • Rewrites Query • Sent to Serving Tree • Aggregate the Results

Dremel: Interactive Analysis of Web-Scale Datasets Query Execution Example SELECT A, SUM(c) FROM (R1 UNION ALL ... RN) GROUP BY A SELECT A, COUNT(B) FROM T GROUP BY A Ri = SELECT A, COUNT(B) AS c FROM Ti GROUP BY A

Dremel: Interactive Analysis of Web-Scale Datasets Beyond One-Pass and Query Dispatcher • Dremel Supports More Than One-Pass • Broadcast Join • Repartition the Data • SELECT-INTO • Query Dispatch Based on Priority and Load Balance • Fault tolerance with rescheduling • Slots and Histograms • Approximation and Tablet percentage

Dremel: Interactive Analysis of Web-Scale Datasets Experiment Environment • Real Google Datasets • Uncompressed about one Petabyte • Three-way Replicated except one • 100K to 800K tablets

Dremel: Interactive Analysis of Web-Scale Datasets Local Disk Performance • Trade of Columnar vs. Record Oriented • 1 GB Data on Dual Core Intel with 70 MB/s read Bandwidth • Columnar, 375 MB Light Compression • Record Oriented, Same size Heavier Compression

Dremel: Interactive Analysis of Web-Scale Datasets MR and Dremel • Average Term Frequency is Analyzed • 3000 Map Reduce workers and 3000 Dremel Nodes • 0.5 TB read on columnar as oppose to 87TB record oriented • Overhead of launching and scheduling jobs, assembling records

Dremel: Interactive Analysis of Web-Scale Datasets Serving Tree Topology • Queries with Different Number of Levels • First query reads about 60 GB • Second query read about 180GB • 2-Level is 1:2900, 3-Level is 1:100:2900, • 4-Level is 1:10:100:2900

Dremel: Interactive Analysis of Web-Scale Datasets Per Tablet Histogram • Tablet process Rates are Investigated • %99 for First query is done in 1 second • %99 for Second query is done in 2 second

Dremel: Interactive Analysis of Web-Scale Datasets Within-Record Aggregation • Effect of Nesting and Columnar Storage • Only 13GB is read due to columnar storage • Without nesting query would be much more expensive

Dremel: Interactive Analysis of Web-Scale Datasets Scalability • Top 20 aid’s on 4.2TB compressed data • 1000 to 4000 nodes • CPU time is nearly identical 300K seconds • Near linear scalability

Dremel: Interactive Analysis of Web-Scale Datasets Stragglers • Only two-way replication on T5 • 99% done in 5 seconds • Less replication more Stragglers

Dremel: Interactive Analysis of Web-Scale Datasets Observations • Scan-based queries can be executed on Web Scale • Near-linear scalability is achievable • Mapreduce can benefit from columnar storage • Parallel DBMS can benefit from serving trees • Record assembly and parsing is expensive • Mapreduce and Dremel can be used complementarily

MapReduce in Heterogeneous Environments Thank You for Patience Questions & Comments

MapReduce in Heterogeneous Environments References • Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In Richard Draves and Robbert van Renesse, editors, OSDI, pages 29–42. USENIX Association, 2008 • Hadoop, http://lucene.apache.org/hadoop • Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2 • EC2 Case Studies, tinyurl.com/46vyut

Dremel: Interactive Analysis of Web-Scale Datasets Algorithms

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel: Interactive Analysis of Web-Scale Datasets

Presentation Transcript

INTERACTIVE ICT TOOLS LINKING MATHEMATICS, SCIENCE AND ROBOTICS – GETTING THE MOST FROM GAME MAKER .

BC Seafood Sector SWOT Analysis

Computational methods in phylogenetic analysis

Design of Large Scale Log Analysis Studies A short tutorial…

Job Analysis

Exploratory Data Analysis and Data Visualization

Working with the ECLS-B Datasets Weights and other issues.

Resilient Distributed Datasets (NSDI 2012)

Protein Sequence Motifs

DESIGN ANALYSIS for a SMALL SCALE ENGINE

Analysis of Large Scale Visual Recognition

Using Interactive Notebooks

Economies and Diseconomies of Scale

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Introduction to job analysis

Making best use of TAIR tools and datasets

Meso- and Storm-Scale NWP: Scientific and Operational Challenges for the Next Decade

Shared Interactive Reading