1 / 30

Hadoop Ecosystem Overview

Hadoop Ecosystem Overview. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Agenda. Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future lectures Discuss potential use cases for each project. HDFS.

yetty
Download Presentation

Hadoop Ecosystem Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop Ecosystem Overview CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. Agenda • Introduce Hadoop projects to prepare you for your group work • Intimate detail will be provided in future lectures • Discuss potential use cases for each project

  3. HDFS • Hadoop Distributed File System • High-performance file system for storing data • We’ve talked about this enough

  4. Hadoop MapReduce • High-performance fault-tolerance data processing system • We’ve also talked about this enough

  5. YARN • Abstract framework for distributed application development • Split functionality of JobTracker into two components • ResourceManager • ApplicationMaster • TaskTracker becomes NodeManager • Containers instead of map and reduce slots • Configurable amount of memory per NodeManager

  6. MapReduce 2.x on YARN • MapReduce API has not changed • Binary-level backwards compatible (no recompile) • Application Master launches and monitors job via YARN • MapReduce History Server to store… history • Enabled Yahoo! to scale beyond 4,000 nodes

  7. Hadoop Ecosystem • Core Technologies • Hadoop Distributed File System • Hadoop MapReduce • Many other tools… • Which we will be discussing… now

  8. Apache Sqoop • Top-level Apache project designed for efficient transfer between Apache Hadoop and structured data stores • Use cases?

  9. Apache Flume • Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data • Use cases?

  10. Apache Pig • Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs • Infrastructure compiles language to a sequence of MapReduce programs • Use cases?

  11. Apache Hive • Data warehouse facilitating querying and managing large datasets • Compiles SQL-like queries into MapReduce programs • Use cases?

  12. Hadoop Streaming • Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer • Use cases?

  13. Which high-level API is for you? • What are you comfortable with? • What are you being told to use?

  14. Apache HBase • Distributed, scalable, big data store • Data stored as sorted key/value pairs, with the key consisting of a row and column • Use cases?

  15. Apache Accumulo • Robust, scalable, high-performance data storage and retrieval key/value store • Cell-based access controls • i.e. cell-level security • Use cases?

  16. Apache Avro • Data serialization system for the Hadoop ecosystem • Use cases?

  17. Parquet • Columnar storage format for Hadoop • Use cases?

  18. Apache Mahout • Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce • Use cases?

  19. Apache Oozie • Workflow scheduler system to manage Apache Hadoop jobs • Use cases?

  20. Storm • Distributed real-time computation system • How is this different than MapReduce? • Use cases?

  21. ZooKeeper • Effort to develop and maintain and open-source server enabling highly reliable distributed coordination • Use cases?

  22. SQL on Hadoop • Apache Drill, Cloudera Impala, Facebook Presto, HortonworksStinger, Pivotal HAWQ, etc. • SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store • Use cases? Non use cases?

  23. We won’t be covering these in detail later on Other Hadoop Projects

  24. Apache Cassandra • NoSQL database for managing large amounts of structured, semi-structured, and unstructured data • Support for clusters spanning multiple datacenters • Unlike HBase and Accumulo, data is not stored on HDFS • Use cases? Non use cases?

  25. Azkaban • Batch workflow job scheduler to run Hadoop jobs • Use cases?

  26. Apache Spark • Fast and general engine for large-scale data processing • Write applications in Java, Scala, or Python • Use cases?

  27. Shark • Large-scale data warehouse for Spark and compatible with Apache Hive • Use cases?

  28. Redis, Memcached, etc. • Open-source in-memory key/value stores • Use cases?

  29. Review • A lot of projects available to you for your group work • Feel free to explore and use other projects than the ones I have listed here • Get permission if you plan on using it as part of your group work project quota

  30. References • *.apache.org • parquet.io • storm-project.net • redis.io • spark.incubator.apache.org • incubator.apache.org/drill • github.com/amplab/shark/wiki

More Related