280 likes | 613 Views
Dremel: Interactive Analysis of Web-Scale Datasets Ridvan Dongelci Department of Information and Computer Science Aalto University, School of Science and Technology ridvan.dongelci@aalto.fi April 15 , 2013. Dremel: Interactive Analysis of Web-Scale Datasets. Dremel and Motivation
E N D
Dremel: Interactive Analysis of Web-Scale DatasetsRidvan DongelciDepartment of Information and Computer ScienceAalto University, School of Science and Technologyridvan.dongelci@aalto.fiApril 15, 2013
Dremel: Interactive Analysis of Web-Scale Datasets • Dremel and Motivation • Columnar Storage • Query Language and Execution • Experiments • Observations and Conclusion Dremel: Interactive Analysis of Web-Scale Datasets 2
Background • Web-scale Dataset • Data Exploration, Rapid Prototyping • Speed Matters SLOW! Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel • Interactive Analysis of Web-Scale Dataset • Scalable, Fault Tolerant, Fast • Analysis on in situ data • Bigtable, Google File System • Widely Used In Google • BigQuery, Google Books, Web Analysis In place data
Dremel: Interactive Analysis of Web-Scale Datasets Dremel Key Concepts • Nested Columnar Storage • Google Protocol Buffer for Processing and Storing • SQL-like Query Language • Execution with Serving Trees • Inspired from Web Search
Dremel: Interactive Analysis of Web-Scale Datasets Data Model • Record-wise vs. Columnar Storage • SELECT SUM(A.B.C) FROM t
Dremel: Interactive Analysis of Web-Scale Datasets Nested Columnar Storage • Repetition and Definition Levels r1.Name1.Language.Code ‘en-us’ r1.Name1.Language.Code ‘en’ r1.Name2 r1.Name3 ‘en-gb’ r2.Name1
Dremel: Interactive Analysis of Web-Scale Datasets Nested Columnar Storage • Splitting Records into Columns • Record Assembly
Dremel: Interactive Analysis of Web-Scale Datasets Tablet Layout and Tricks • Tablet Storage and Horizontal Partitioning • Save Space • Nulls are not stored • Definition levels are not stored if always defined • Repetition levels are only stored when needed • Levels are packed as bit sequence
Dremel: Interactive Analysis of Web-Scale Datasets Query Language • SQL-like Language • Efficient on columnar storage • Input one or multiple table and their schema • Outputs a table and its schema • WHERE prunes branches • Support Following Operations • Nested sub-queries, inter/intra-record aggregation • Top K queries, Joins, User defined functions
Dremel: Interactive Analysis of Web-Scale Datasets Query Language Example
Dremel: Interactive Analysis of Web-Scale Datasets Query Execution • Many Queries are one pass • Execution on Serving Trees • Parallel scheduling and aggregation • Fault tolerance and deal with stragglers • Root Server Receives Incoming Query • Fetches Metadata and Schema • Determines Tablets • Rewrites Query • Sent to Serving Tree • Aggregate the Results
Dremel: Interactive Analysis of Web-Scale Datasets Query Execution Example SELECT A, SUM(c) FROM (R1 UNION ALL ... RN) GROUP BY A SELECT A, COUNT(B) FROM T GROUP BY A Ri = SELECT A, COUNT(B) AS c FROM Ti GROUP BY A
Dremel: Interactive Analysis of Web-Scale Datasets Beyond One-Pass and Query Dispatcher • Dremel Supports More Than One-Pass • Broadcast Join • Repartition the Data • SELECT-INTO • Query Dispatch Based on Priority and Load Balance • Fault tolerance with rescheduling • Slots and Histograms • Approximation and Tablet percentage
Dremel: Interactive Analysis of Web-Scale Datasets Experiment Environment • Real Google Datasets • Uncompressed about one Petabyte • Three-way Replicated except one • 100K to 800K tablets
Dremel: Interactive Analysis of Web-Scale Datasets Local Disk Performance • Trade of Columnar vs. Record Oriented • 1 GB Data on Dual Core Intel with 70 MB/s read Bandwidth • Columnar, 375 MB Light Compression • Record Oriented, Same size Heavier Compression
Dremel: Interactive Analysis of Web-Scale Datasets MR and Dremel • Average Term Frequency is Analyzed • 3000 Map Reduce workers and 3000 Dremel Nodes • 0.5 TB read on columnar as oppose to 87TB record oriented • Overhead of launching and scheduling jobs, assembling records
Dremel: Interactive Analysis of Web-Scale Datasets Serving Tree Topology • Queries with Different Number of Levels • First query reads about 60 GB • Second query read about 180GB • 2-Level is 1:2900, 3-Level is 1:100:2900, • 4-Level is 1:10:100:2900
Dremel: Interactive Analysis of Web-Scale Datasets Per Tablet Histogram • Tablet process Rates are Investigated • %99 for First query is done in 1 second • %99 for Second query is done in 2 second
Dremel: Interactive Analysis of Web-Scale Datasets Within-Record Aggregation • Effect of Nesting and Columnar Storage • Only 13GB is read due to columnar storage • Without nesting query would be much more expensive
Dremel: Interactive Analysis of Web-Scale Datasets Scalability • Top 20 aid’s on 4.2TB compressed data • 1000 to 4000 nodes • CPU time is nearly identical 300K seconds • Near linear scalability
Dremel: Interactive Analysis of Web-Scale Datasets Stragglers • Only two-way replication on T5 • 99% done in 5 seconds • Less replication more Stragglers
Dremel: Interactive Analysis of Web-Scale Datasets Observations • Scan-based queries can be executed on Web Scale • Near-linear scalability is achievable • Mapreduce can benefit from columnar storage • Parallel DBMS can benefit from serving trees • Record assembly and parsing is expensive • Mapreduce and Dremel can be used complementarily
MapReduce in Heterogeneous Environments Thank You for Patience Questions & Comments
MapReduce in Heterogeneous Environments References • Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In Richard Draves and Robbert van Renesse, editors, OSDI, pages 29–42. USENIX Association, 2008 • Hadoop, http://lucene.apache.org/hadoop • Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2 • EC2 Case Studies, tinyurl.com/46vyut
Dremel: Interactive Analysis of Web-Scale Datasets Algorithms
Dremel: Interactive Analysis of Web-Scale Datasets Algorithms