260 likes | 561 Views
HadoopDB. Inneke Ponet. Introduction Technologies for data analysis HadoopDB Desired properties Layers of HadoopDB HadoopDB Components. Introduction. More and more data needs to be stored and processed .
E N D
HadoopDB Inneke Ponet
Introduction • Technologies for data analysis • HadoopDB • Desired properties • Layers of HadoopDB • HadoopDB Components
Introduction • More and more data needstobestoredandprocessed. • People want to do more and more complex calculations on theircollected data. • Analytical databases on high-end machines are movingtowardscheaperlower-end machines. • The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually.
Technologies for data analysis Parallel databases: • good performance, • good efficiency. MapReduce-based systems: • superior scalability, • goodfaulttolerance, • good flexibility to handle unstructered data.
Parallel databases • Support for standard relationaltablesand SQL. • Implementstechniquesfor a better performance: • Indexing, compression, materialized views, resultcaching, I/O sharing. • Data is partitioned (shared-nothingarchitecture) transparentto the end-user.
Shared-nothingarchitecture The DBMS of the most analytical databases are deployed on a shared-notingarchitecture: • A collection of machines that • are independent, • are possible virtual, • have theirownlocal disk andlocalmain memory, • are connectedby a high-speed network. Scalability of machines. Analysis tasks are easy to parallellize.
MapReduce A technologyfrom Google: • processes (un)structured data that is distributed on manynodes in a shared-nothing cluster; • works at enormousscale. MapandReduce: parallel without communicating; Map-repartition-Reducecycles.
MapReduce: advantages No detailed query execution plan in advanceat runtime: adjustto node failuresand slow nodes(re)assigningtaskstofasternodes. Checkpoints the output tolocal disk minimizing of the workin case of a failure.
HadoopDB Hybrid database: acombination of: • traditional DBMS, • MapReduce-technology. Developedby Yale University students: AzzaAbouzeidandKamilBajDa-Pawlikowski It is free and open source.
Desiredproperties • Performance • Faulttolerance • Heterogeneous environment • Flexible query interface • Scalability
A. Performance • Primarycharacteristictodistinguish. • MapReduce: first modelingandloading data before processing slower performance than parallel databases. • Costsaving: faster software product cheaperthan a hardware upgrade or buyingadditional hardware.
B. Faulttolerance • Succesfullycommit transactions. • Make progress on a workload. • Heterogeneityandscalibility more faultsBUT MapReducegoodfaulttolerance: • reassigningtasks; • sub-tasksminimize the effect of faults. • Parallel databases: assumptionfailures are rare more testing => slower performance.
C. Heterogeneous environment • Nodesdon’talways run on • identical hardware, • anidentical virtual machine. Different performance. • Parallel databases: nottested on more than 100 nodes.
D. Flexible query interface • Easy to make queries: SQL and non-SQL interface languages, Use of tools. • Robust mechanisme forwritingUDFs. • Parallel databases: SQL, ODBC andUDFs. • MapReduce-based systems: it is possible(Hive), but notalways (Hadoop).
E. Scalability Traditional DBMS: • onlyscalableto 100 nodes. MapReduce-based systems: • designedtoscaletothousands of nodes in a shared-nothingarchitecture.
Layers of HadoopDB • Communication: Hadoop • Database: PostgreSQL • Translation: Hive
Hadoop • Communication layer of HadoopDB. • Hadoopframeworktwolayers: • Hadoop Distributed File System (HDFS), • MapReduceframework. Cost: free/open source MapReduce.
PostgreSQL • Relational DBMS. • (Possible) database layer of HadoopDB. Cost: free/open source.
Hive • Translationlayer. • Processing of a SQL query: • Query Abstract Syntax Tree. • MetaStore: schema of the table(s). • Logical query plan: DAG of relational operators. • Optimized plan. • Physicalexecutable plan: MapReduce job(s). • XML plan: DAG serialized. • Hive Driver executes a Hadoop job.
HadoopDB components • Database Connector: • Interface between independent database systems; • Extends the InputFormat class (of Hadoop); • Connect toany JDBC-compliant database. • Catalog: • Meta-information about the databases: • connection parameters, • metadata. • XML file in HDFS accessedby: • Master node, • Worker/Slavenodes.
HadoopDB Components (2) • Data loader: • Global hasher: • CustomMapReduce job files in HDFS; • Repartioning data uponloading. • Localhasher: • Copiespartitionfrom HDFS tolocal file system; • Partitions the file in smaller sizedchunks.
HadoopDB Components (3) • SQL toMapReduce: • Parallel database front-end toprocess SQL queries. HiveQL ↓ Transform MapReduce jobs: • Connect totablesstored in HDFS; • Consists of DAGs of relational operators thatoperate as iterators. • Assumption no collection of tables: • Operations on multiple tablesReducefunction. NOT in HadoopDB: a joinoperationcanbepushedto the databselayer.
HadoopDB Components (4) • SQL/SMS planner: • ModifiesHive: • Updates the MetaStore • Two passes over the physical plan: • Determine the partitionkeysfor the ReduceSink Operators. • Operators are: • converted in SQL querie(s); • pushedinto the database layer. • Only filter, select andaggregation operators.