1 / 21

HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Submitted to: Dr. Amer Badarneh By : Muneera AL_Quraan. Outlines. Reasons for data exploding Shared-nothing architecture Parallel databases MapReduce systems The Idea of HadoopDB

joyce
Download Presentation

HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HadoopDB:An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Submitted to: Dr. AmerBadarneh By : MuneeraAL_Quraan

  2. Outlines • Reasons for data exploding • Shared-nothing architecture • Parallel databases • MapReduce systems • The Idea of HadoopDB • Desired properties • Hadoop Implementation Background • The Architecture of HadoopDB • HadoopDB’s Components • HadoopDB’s Evaluation

  3. Reasons for data exploding • The amount of data that needs to be stored and processed by analytical database systems is exploding: • the increased automation (more business processes are becoming digitized) • the proliferation of sensors and data-producing devices • Web-scale interactions with customers • government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis

  4. Shared-nothing architecture • Because of the exploding data problem, all of the mentioned analytical database start-ups deploy their DBMS on a shared-nothing architecture • a collection of independent (possibly virtual) machines • each with local disk and local main memory • connected together on a high-speed network • In this paper, the analytical DBMS systems that deploy on a shared-nothing architecture means parallel databases.

  5. Parallel databases • Performance and efficiency of parallel databases makes them well suited to perform big data analysis • Parallel databases proved to scale really well into the tens of nodes. • Reasons why parallel databases do not scale well into the hundreds of nodes : • failures become increasingly common as one adds more nodes to a system • nearly impossible to achieve pure homogeneity at scale • parallel databases have not been tested at larger scales applications

  6. MapReduce systems • Hadoop is one of MapReduce system. • Good choice for big data analysis : • Scalability(proved success in Google’s internal operations) • Fault tolerance • Flexibility to handle unstructured data • Cost (open source)

  7. The Idea of HadoopDB • Combine the scalability advantages of MapReduce with the performance and efficiency advantages of parallel databases to achieve a hybrid system: • well suited for the analytical DBMS market. • and can handle the future demands of data intensive applications.

  8. Desired properties • Performance • Fault Tolerance • does not have to restart a query if one of the nodes involved in query processing fails. • Ability to run in a heterogeneous environment • Flexible query interface • have a robust mechanism for allowing the user to write user defined functions (UDFs) and queries • The goal of HadoopDB is to achieve all of the desired properties

  9. Hadoop Implementation Background • Hadoop consists of two layers: • a data storage layer or the Hadoop Distributed File System (HDFS) • a data processing layer or the MapReduce Framework.

  10. Hadoop Implementation Background • HDFS: • is a block-structured file system managed by a central NameNode. • Individual files are broken into blocks of a fixed size and distributed across multiple DataNodes in the cluster. • The NameNode maintains metadata about the size and location of blocks and their replicas.

  11. Hadoop Implementation Background • The MapReduce Framework follows a simple master-slave architecture. • The master is a single JobTracker • and the slaves or worker nodes are TaskTrackers. • The JobTracker • handles the runtime scheduling of MapReduce jobs • and maintains information on each TaskTracker’s load and available resources.

  12. Hadoop Implementation Background • TaskTrackers • regularly update the JobTracker with their status • InputFormat library • represents the interface between the storage and processing layers

  13. The Architecture of HadoopDB

  14. HadoopDB’s Components • Database Connector: • the interface between independent database systems residing on nodes in the cluster and TaskTrackers. • It extends Hadoop’s InputFormat class • and is part of the Input-Format Implementations library • The Connector connects to the database, executes the SQL query and returns results as key-value pairs.

  15. HadoopDB’s Components • Catalog: • maintains metainformation about the databases • connection parameters such as database location, driver class and credentials • metadata such as data sets contained in the cluster, replica locations, and data partitioning properties

  16. HadoopDB’s Components • Data Loader: • responsible for: • globally repartitioning data on a given partition key upon loading • breaking apart single node data into multiple smaller partitions or chunks • bulk-loading the single-node databases with the chunks • consists of two main components: • Global Hasher and Local Hasher • to ensure chunks are of a uniform size

  17. HadoopDB’s Components • SQL to MapReduce to SQL (SMS) Planner: • The SMS planner extends Hive. • Hive transforms HiveQL, a variant of SQL, into MapReduce jobs that connect to tables stored as files in HDFS.

  18. HadoopDB’s Evaluation • By compare the performance of: • Hadoop • HadoopDB • Vertica • DBMS-X

  19. HadoopDB’s Evaluation • The 5 benchmark tasks: • First task operate on unstructured data • The “Grep task” • Three tasks operate on structured data • Selection Task • Aggregation Task • Join task • final task operates on both structured and unstructured data • UDF Aggregation Task

  20. HadoopDB’s Evaluation • The datasets used in this experiments: • UserVisits table (20GB) • Documents table (1GB) per node • Rankings table

  21. HadoopDB’s Evaluation • Summary of Results: • HadoopDB is able to approach the performance of the parallel database systems • HadoopDB consistently outperforms Hadoop • HadoopDB’s load time is about 10 times longer than Hadoop’s

More Related