230 likes | 367 Views
MapReduce VS Parallel DBMSs. Presenter: Ran Ding. G uideline. 1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. C onclusion. Introduction-----MR.
E N D
MapReduce VS Parallel DBMSs Presenter: Ran Ding
Guideline • 1. Introduction • 2. Where the MR wins • 3. DBMS “sweet spot” tests • 4. Why the Parallel DBMS wins • 5. Conclusion
Introduction-----MR • The MapReduce(MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access. • Like Hadoop
Introduction----Parallel DBMS • Parallel DBMS appeared at mid-1980. the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.
Introduction---Horizontal partitioning • Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.
Introduction---DBMS • One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query. • Like hash, range, and round-robin…..
Introduction-- Mapping parallel DBMS onto MapReduce • It is not easy!!!!!! • UDF(user defined field) helps. • Like GROUP BY in SQL.
Where the MR wins • 1. ETL and “read once” data sets • 2. Complex analytics • 3. Semi-structured data • 4. Quick-and-dirty analyses • 5. Limited-budget operations
ETL and “read once” data sets • Extract-transform-load system • MR system can be considered a general-purpose parallel ETL system. • DBMSs may perform the ETL
Complex analytics • Cannot be structured as single SQL aggregate queries • MR is a good candidate
Semi-structured data • MR systems are good at processing the data is prepared for loading into a back-end system • DBMS requires wide tables with many attributes • Plus, MR-style systems are easily store and process
Quick-and-dirty analyses • DBMS need the programmer write the schema then load • MR just copy!
Limited-budget operations • MR is basically open sourcefor free • Parallel DBMS: huge cost
Why the Parallel DBMS wins • 1. Repetitive record parsing • 2. Compression • 3. Pipelining • 4. Scheduling • 5. Column-oriented storage
Repetitive record parsing • Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type • Records are parsed by DBMSs when the data is initially loaded.
Compression • It is hard to say…….. • Commercial DBMSs may use carefully tuned compression algorithms
Pipelining • In parallel DBMS, data is streamed from producer to consumer • the intermediate data is never written to disk • In MR system, it writes the result to local data structure, and consumers read from it
Scheduling • In a parallel DBMS, every node knows what it should do • MR system is scheduled on processing nodes one storage block at a time.
Column-oriented storage • Vertica • Reads only the attributes necessary for solving the user query • DBMS-X and Hadoopare both row stores
What should MR learn from Parallel DBMS • MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.
Conclusion • MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice