80 likes | 382 Views
Integrating Hadoop and Parallel DBMS. Yu Xu , Pekka Kostamaa , Like Gao Teradata. Presentation By Abhishek Biswas Prepared for CS795 Distributed Systems Date 4/11/2011. Parallel DBMS vs Hadoop. Slow to load very high volume data into an RDBMS Fast execution of queries
E N D
Integrating Hadoop and Parallel DBMS Yu Xu, PekkaKostamaa, Like Gao Teradata Presentation By Abhishek Biswas Prepared for CS795 Distributed Systems Date 4/11/2011
Parallel DBMS vs Hadoop • Slow to load very high volume data into an RDBMS • Fast execution of queries • Easy to write SQL for complex BI analysis • Expensive • HDFS has reliability and quick load time • 2-3 times slower in execution of queries • Difficult to write MapReduce programs • Low Cost
Efficient Integration ofHadoop and Teradata EDW • Motivation • Use Hadoop DFS to store extreme large volumes of data not needed immediately • Use parallel RDBMS to store vital information • Use a parallel DBMS like Teradata Enterprise Data Warehouse(EDW) for performance and more BI functionality on both data stores • Three interfaces developed for integration • DirectLoad – Load Hadoop data into EDW • TeradataInputFormat – EDW data in MapReduce programs • Table UDF – Retrieve Hadoop data as a table
Parallel Loading of Hadoop Data to Teradata EDW Teradata EDW 4 3 2 1 Hadoop AMP Direct Load NameNode 1 Hadoop API AMP Direct Load G A T E W A Y 2 Hadoop API DataNode Hadoop API Direct Load 3 AMP Hadoop API DataNode Hadoop API 4 Direct Load AMP Hadoop API 5 AMP Direct Load 6 AMP Direct Load AMP- Access Module Processors
Retrieving EDW Data from MapReducePrograms MapReduce Program Teradata EDW 1 2 3 TeradataInputFormat CREATE TABLE T AS (Q) WITH DATA PRIMARY INDEX ( c1 ) PARTITION BY (c2 MOD M) + 1 Q AMP AMP #(c1) SELECT * FROM T WHERE PARTITION = i Mapper 1 AMP SELECT * FROM T WHERE PARTITION = i Mapper 2 AMP AMP SELECT * FROM T WHERE PARTITION = i Mapper M AMP
Accessing Hadoop Data From SQLvia Table UDF • A table UDF (User Defined Function) named HDFSUDF which pulls data from Hadoop to Teradata EDW • INSERT INTO Tab1 SELECT * FROM TABLE (HDFSUDF(‘’)) AS T1;
Contd… Teradata EDW 4 3 2 1 5 Hadoop # AMP Size(mydfsfile.txt) AMP HDFSUDF NameNode 1 AMP HDFSUDF DataNode HDFSUDF AMP DataNode HDFSUDF AMP AMP HDFSUDF AMP HDFSUDF