HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB:An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Submitted to: Dr. AmerBadarneh By : MuneeraAL_Quraan

Outlines • Reasons for data exploding • Shared-nothing architecture • Parallel databases • MapReduce systems • The Idea of HadoopDB • Desired properties • Hadoop Implementation Background • The Architecture of HadoopDB • HadoopDB’s Components • HadoopDB’s Evaluation

Reasons for data exploding • The amount of data that needs to be stored and processed by analytical database systems is exploding: • the increased automation (more business processes are becoming digitized) • the proliferation of sensors and data-producing devices • Web-scale interactions with customers • government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis

Shared-nothing architecture • Because of the exploding data problem, all of the mentioned analytical database start-ups deploy their DBMS on a shared-nothing architecture • a collection of independent (possibly virtual) machines • each with local disk and local main memory • connected together on a high-speed network • In this paper, the analytical DBMS systems that deploy on a shared-nothing architecture means parallel databases.

Parallel databases • Performance and efficiency of parallel databases makes them well suited to perform big data analysis • Parallel databases proved to scale really well into the tens of nodes. • Reasons why parallel databases do not scale well into the hundreds of nodes : • failures become increasingly common as one adds more nodes to a system • nearly impossible to achieve pure homogeneity at scale • parallel databases have not been tested at larger scales applications

MapReduce systems • Hadoop is one of MapReduce system. • Good choice for big data analysis : • Scalability(proved success in Google’s internal operations) • Fault tolerance • Flexibility to handle unstructured data • Cost (open source)

The Idea of HadoopDB • Combine the scalability advantages of MapReduce with the performance and efficiency advantages of parallel databases to achieve a hybrid system: • well suited for the analytical DBMS market. • and can handle the future demands of data intensive applications.

Desired properties • Performance • Fault Tolerance • does not have to restart a query if one of the nodes involved in query processing fails. • Ability to run in a heterogeneous environment • Flexible query interface • have a robust mechanism for allowing the user to write user defined functions (UDFs) and queries • The goal of HadoopDB is to achieve all of the desired properties

Hadoop Implementation Background • Hadoop consists of two layers: • a data storage layer or the Hadoop Distributed File System (HDFS) • a data processing layer or the MapReduce Framework.

Hadoop Implementation Background • HDFS: • is a block-structured file system managed by a central NameNode. • Individual files are broken into blocks of a fixed size and distributed across multiple DataNodes in the cluster. • The NameNode maintains metadata about the size and location of blocks and their replicas.

Hadoop Implementation Background • The MapReduce Framework follows a simple master-slave architecture. • The master is a single JobTracker • and the slaves or worker nodes are TaskTrackers. • The JobTracker • handles the runtime scheduling of MapReduce jobs • and maintains information on each TaskTracker’s load and available resources.

Hadoop Implementation Background • TaskTrackers • regularly update the JobTracker with their status • InputFormat library • represents the interface between the storage and processing layers

The Architecture of HadoopDB

HadoopDB’s Components • Database Connector: • the interface between independent database systems residing on nodes in the cluster and TaskTrackers. • It extends Hadoop’s InputFormat class • and is part of the Input-Format Implementations library • The Connector connects to the database, executes the SQL query and returns results as key-value pairs.

HadoopDB’s Components • Catalog: • maintains metainformation about the databases • connection parameters such as database location, driver class and credentials • metadata such as data sets contained in the cluster, replica locations, and data partitioning properties

HadoopDB’s Components • Data Loader: • responsible for: • globally repartitioning data on a given partition key upon loading • breaking apart single node data into multiple smaller partitions or chunks • bulk-loading the single-node databases with the chunks • consists of two main components: • Global Hasher and Local Hasher • to ensure chunks are of a uniform size

HadoopDB’s Components • SQL to MapReduce to SQL (SMS) Planner: • The SMS planner extends Hive. • Hive transforms HiveQL, a variant of SQL, into MapReduce jobs that connect to tables stored as files in HDFS.

HadoopDB’s Evaluation • By compare the performance of: • Hadoop • HadoopDB • Vertica • DBMS-X

HadoopDB’s Evaluation • The 5 benchmark tasks: • First task operate on unstructured data • The “Grep task” • Three tasks operate on structured data • Selection Task • Aggregation Task • Join task • final task operates on both structured and unstructured data • UDF Aggregation Task

HadoopDB’s Evaluation • The datasets used in this experiments: • UserVisits table (20GB) • Documents table (1GB) per node • Rankings table

HadoopDB’s Evaluation • Summary of Results: • HadoopDB is able to approach the performance of the parallel database systems • HadoopDB consistently outperforms Hadoop • HadoopDB’s load time is about 10 times longer than Hadoop’s

HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Presentation Transcript

Algorithmic Sustainable Design: The Future of Architectural Theory.

Gasoline/Electric Hybrid Vehicle Update ‘06

Pre-analytical Laboratory Errors

Capacity Planning for the Newer Workloads

NORMAN support for validation of methods at the European level

Review of Analytical Methods Part 1: Spectrophotometry

Quality Issues in Coagulation Laboratory

DBMS: Past, Present, and the Future

CPE 619 Workloads: Types, Selection, Characterization

Apache Mahout Feb 13, 2012 Shannon Quinn

MapReduce , Collective Communication, and Services

Memory Systems in the Multi-Core Era Lecture 2.2: Emerging Technologies and Hybrid Memories

Analytical Chemistry

Architectural Design

Unit 1 Introduction to DBMS (Database Management Systems)

Analytical Exposition

Information Systems Development

Hybrid Vehicles

Basic SQL