100 likes | 498 Views
Integrating Hadoop and Parallel DBMS. Yu Xu , Pekka Kostamaa , Like Gao - Abhishek Kamurthi. Introduction. Now-a-days Distributed File Systems are being widely used. Google started GFS and MapReduce . Yahoo uses Hadoop . Microsoft has Dryad and Scope. Parallel DBMS vs Hadoop.
E N D
Integrating Hadoop and Parallel DBMS Yu Xu, PekkaKostamaa, Like Gao -AbhishekKamurthi
Introduction • Now-a-days Distributed File Systems are being widely used. • Google started GFS and MapReduce. • Yahoo uses Hadoop. • Microsoft has Dryad and Scope.
Parallel DBMS vsHadoop • Slow to load very high volume data into an RDBMS • Fast execution of queries • Easy to write SQL for complex BI analysis • Expensive • HDFS has reliability and quick load time • 2-3 times slower in execution of queries • Difficult to write MapReduce programs • Low Cost
Three interfaces developed for integration • DirectLoad– Load Hadoop data into EDW • TeradataInputFormat – EDW data in MapReduce programs • Table UDF – Retrieve Hadoop data as a table
Parallel Loading of Hadoop Data to Teradata EDW Teradata EDW 4 3 2 1 Hadoop AMP Direct Load NameNode 1 Hadoop API AMP Direct Load G A T E W A Y 2 Hadoop API DataNode Hadoop API Direct Load 3 AMP Hadoop API DataNode Hadoop API 4 Direct Load AMP Hadoop API 5 AMP Direct Load 6 AMP Direct Load AMP- Access Module Processors
Retrieving EDW Data from MapReducePrograms MapReduce Program Teradata EDW 1 2 3 TeradataInputFormat CREATE TABLE T AS (Q) WITH DATA PRIMARY INDEX ( c1 ) PARTITION BY (c2 MOD M) + 1 Q AMP AMP #(c1) SELECT * FROM T WHERE PARTITION = i Mapper 1 AMP SELECT * FROM T WHERE PARTITION = i Mapper 2 AMP AMP SELECT * FROM T WHERE PARTITION = i Mapper M AMP
Accessing Hadoop Data From SQLvia Table UDF A table UDF (User Defined Function) named HDFSUDF which pulls data from Hadoop to Teradata EDW INSERT INTO Tab1 SELECT * FROM TABLE (HDFSUDF(‘mydfsfile.txt’)) AS T1;
Contd… Teradata EDW 4 3 2 1 5 Hadoop # AMP Size(mydfsfile.txt) AMP HDFSUDF NameNode 1 AMP HDFSUDF DataNode HDFSUDF AMP DataNode HDFSUDF AMP AMP HDFSUDF AMP HDFSUDF
Future work • Moving more computation from Hadoop to Teradata EDW or from Teradata EDW to Hadoop.