250 likes | 613 Views
HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Submitted to: Dr. Amer Badarneh By : Muneera AL_Quraan. Outlines. Reasons for data exploding Shared-nothing architecture Parallel databases MapReduce systems The Idea of HadoopDB
E N D
HadoopDB:An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Submitted to: Dr. AmerBadarneh By : MuneeraAL_Quraan
Outlines • Reasons for data exploding • Shared-nothing architecture • Parallel databases • MapReduce systems • The Idea of HadoopDB • Desired properties • Hadoop Implementation Background • The Architecture of HadoopDB • HadoopDB’s Components • HadoopDB’s Evaluation
Reasons for data exploding • The amount of data that needs to be stored and processed by analytical database systems is exploding: • the increased automation (more business processes are becoming digitized) • the proliferation of sensors and data-producing devices • Web-scale interactions with customers • government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis
Shared-nothing architecture • Because of the exploding data problem, all of the mentioned analytical database start-ups deploy their DBMS on a shared-nothing architecture • a collection of independent (possibly virtual) machines • each with local disk and local main memory • connected together on a high-speed network • In this paper, the analytical DBMS systems that deploy on a shared-nothing architecture means parallel databases.
Parallel databases • Performance and efficiency of parallel databases makes them well suited to perform big data analysis • Parallel databases proved to scale really well into the tens of nodes. • Reasons why parallel databases do not scale well into the hundreds of nodes : • failures become increasingly common as one adds more nodes to a system • nearly impossible to achieve pure homogeneity at scale • parallel databases have not been tested at larger scales applications
MapReduce systems • Hadoop is one of MapReduce system. • Good choice for big data analysis : • Scalability(proved success in Google’s internal operations) • Fault tolerance • Flexibility to handle unstructured data • Cost (open source)
The Idea of HadoopDB • Combine the scalability advantages of MapReduce with the performance and efficiency advantages of parallel databases to achieve a hybrid system: • well suited for the analytical DBMS market. • and can handle the future demands of data intensive applications.
Desired properties • Performance • Fault Tolerance • does not have to restart a query if one of the nodes involved in query processing fails. • Ability to run in a heterogeneous environment • Flexible query interface • have a robust mechanism for allowing the user to write user defined functions (UDFs) and queries • The goal of HadoopDB is to achieve all of the desired properties
Hadoop Implementation Background • Hadoop consists of two layers: • a data storage layer or the Hadoop Distributed File System (HDFS) • a data processing layer or the MapReduce Framework.
Hadoop Implementation Background • HDFS: • is a block-structured file system managed by a central NameNode. • Individual files are broken into blocks of a fixed size and distributed across multiple DataNodes in the cluster. • The NameNode maintains metadata about the size and location of blocks and their replicas.
Hadoop Implementation Background • The MapReduce Framework follows a simple master-slave architecture. • The master is a single JobTracker • and the slaves or worker nodes are TaskTrackers. • The JobTracker • handles the runtime scheduling of MapReduce jobs • and maintains information on each TaskTracker’s load and available resources.
Hadoop Implementation Background • TaskTrackers • regularly update the JobTracker with their status • InputFormat library • represents the interface between the storage and processing layers
HadoopDB’s Components • Database Connector: • the interface between independent database systems residing on nodes in the cluster and TaskTrackers. • It extends Hadoop’s InputFormat class • and is part of the Input-Format Implementations library • The Connector connects to the database, executes the SQL query and returns results as key-value pairs.
HadoopDB’s Components • Catalog: • maintains metainformation about the databases • connection parameters such as database location, driver class and credentials • metadata such as data sets contained in the cluster, replica locations, and data partitioning properties
HadoopDB’s Components • Data Loader: • responsible for: • globally repartitioning data on a given partition key upon loading • breaking apart single node data into multiple smaller partitions or chunks • bulk-loading the single-node databases with the chunks • consists of two main components: • Global Hasher and Local Hasher • to ensure chunks are of a uniform size
HadoopDB’s Components • SQL to MapReduce to SQL (SMS) Planner: • The SMS planner extends Hive. • Hive transforms HiveQL, a variant of SQL, into MapReduce jobs that connect to tables stored as files in HDFS.
HadoopDB’s Evaluation • By compare the performance of: • Hadoop • HadoopDB • Vertica • DBMS-X
HadoopDB’s Evaluation • The 5 benchmark tasks: • First task operate on unstructured data • The “Grep task” • Three tasks operate on structured data • Selection Task • Aggregation Task • Join task • final task operates on both structured and unstructured data • UDF Aggregation Task
HadoopDB’s Evaluation • The datasets used in this experiments: • UserVisits table (20GB) • Documents table (1GB) per node • Rankings table
HadoopDB’s Evaluation • Summary of Results: • HadoopDB is able to approach the performance of the parallel database systems • HadoopDB consistently outperforms Hadoop • HadoopDB’s load time is about 10 times longer than Hadoop’s