HadoopDB

HadoopDB Inneke Ponet

Introduction • Technologies for data analysis • HadoopDB • Desired properties • Layers of HadoopDB • HadoopDB Components

Introduction • More and more data needstobestoredandprocessed. • People want to do more and more complex calculations on theircollected data. • Analytical databases on high-end machines are movingtowardscheaperlower-end machines. • The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually.

Technologies for data analysis Parallel databases: • good performance, • good efficiency. MapReduce-based systems: • superior scalability, • goodfaulttolerance, • good flexibility to handle unstructered data.

Parallel databases • Support for standard relationaltablesand SQL. • Implementstechniquesfor a better performance: • Indexing, compression, materialized views, resultcaching, I/O sharing. • Data is partitioned (shared-nothingarchitecture) transparentto the end-user.

Shared-nothingarchitecture The DBMS of the most analytical databases are deployed on a shared-notingarchitecture: • A collection of machines that • are independent, • are possible virtual, • have theirownlocal disk andlocalmain memory, • are connectedby a high-speed network. Scalability of machines.  Analysis tasks are easy to parallellize.

MapReduce A technologyfrom Google: • processes (un)structured data that is distributed on manynodes in a shared-nothing cluster; • works at enormousscale. MapandReduce:  parallel without communicating;  Map-repartition-Reducecycles.

MapReduce: advantages No detailed query execution plan in advanceat runtime: adjustto node failuresand slow nodes(re)assigningtaskstofasternodes. Checkpoints the output tolocal disk minimizing of the workin case of a failure.

HadoopDB Hybrid database:  acombination of: • traditional DBMS, • MapReduce-technology. Developedby Yale University students: AzzaAbouzeidandKamilBajDa-Pawlikowski  It is free and open source.

Desiredproperties • Performance • Faulttolerance • Heterogeneous environment • Flexible query interface • Scalability

A. Performance • Primarycharacteristictodistinguish. • MapReduce: first modelingandloading data before processing slower performance than parallel databases. • Costsaving: faster software product cheaperthan a hardware upgrade or buyingadditional hardware.

B. Faulttolerance • Succesfullycommit transactions. • Make progress on a workload. • Heterogeneityandscalibility more faultsBUT MapReducegoodfaulttolerance: • reassigningtasks; • sub-tasksminimize the effect of faults. • Parallel databases: assumptionfailures are rare more testing => slower performance.

C. Heterogeneous environment • Nodesdon’talways run on • identical hardware, • anidentical virtual machine.  Different performance. • Parallel databases: nottested on more than 100 nodes.

D. Flexible query interface • Easy to make queries:  SQL and non-SQL interface languages,  Use of tools. • Robust mechanisme forwritingUDFs. • Parallel databases: SQL, ODBC andUDFs. • MapReduce-based systems: it is possible(Hive), but notalways (Hadoop).

E. Scalability Traditional DBMS: • onlyscalableto 100 nodes. MapReduce-based systems: • designedtoscaletothousands of nodes in a shared-nothingarchitecture.

Desiredproperties

Layers of HadoopDB • Communication: Hadoop • Database: PostgreSQL • Translation: Hive

Hadoop • Communication layer of HadoopDB. • Hadoopframeworktwolayers: • Hadoop Distributed File System (HDFS), • MapReduceframework. Cost: free/open source MapReduce.

PostgreSQL • Relational DBMS. • (Possible) database layer of HadoopDB.  Cost: free/open source.

Hive • Translationlayer. • Processing of a SQL query: • Query  Abstract Syntax Tree. • MetaStore: schema of the table(s). • Logical query plan: DAG of relational operators. • Optimized plan. • Physicalexecutable plan: MapReduce job(s). • XML plan: DAG serialized. • Hive Driver executes a Hadoop job.

HadoopDB components • Database Connector: • Interface between independent database systems; • Extends the InputFormat class (of Hadoop); • Connect toany JDBC-compliant database. • Catalog: • Meta-information about the databases: • connection parameters, • metadata. • XML file in HDFS accessedby: • Master node, • Worker/Slavenodes.

HadoopDB Components (2) • Data loader: • Global hasher: • CustomMapReduce job  files in HDFS; • Repartioning data uponloading. • Localhasher: • Copiespartitionfrom HDFS tolocal file system; • Partitions the file in smaller sizedchunks.

HadoopDB Components (3) • SQL toMapReduce: • Parallel database front-end toprocess SQL queries. HiveQL ↓ Transform MapReduce jobs: • Connect totablesstored in HDFS; • Consists of DAGs of relational operators thatoperate as iterators. • Assumption no collection of tables: • Operations on multiple tablesReducefunction.  NOT in HadoopDB: a joinoperationcanbepushedto the databselayer.

HadoopDB Components (4) • SQL/SMS planner: • ModifiesHive: • Updates the MetaStore • Two passes over the physical plan: • Determine the partitionkeysfor the ReduceSink Operators. • Operators are: • converted in SQL querie(s); • pushedinto the database layer. • Only filter, select andaggregation operators.

HadoopDB Components (5)

Questions?

HadoopDB

HadoopDB

Presentation Transcript

HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB project

HadoopDB

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads