Hadoop Ecosystem Overview

Hadoop Ecosystem Overview CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

Agenda • Introduce Hadoop projects to prepare you for your group work • Intimate detail will be provided in future lectures • Discuss potential use cases for each project

HDFS • Hadoop Distributed File System • High-performance file system for storing data • We’ve talked about this enough

Hadoop MapReduce • High-performance fault-tolerance data processing system • We’ve also talked about this enough

YARN • Abstract framework for distributed application development • Split functionality of JobTracker into two components • ResourceManager • ApplicationMaster • TaskTracker becomes NodeManager • Containers instead of map and reduce slots • Configurable amount of memory per NodeManager

MapReduce 2.x on YARN • MapReduce API has not changed • Binary-level backwards compatible (no recompile) • Application Master launches and monitors job via YARN • MapReduce History Server to store… history • Enabled Yahoo! to scale beyond 4,000 nodes

Hadoop Ecosystem • Core Technologies • Hadoop Distributed File System • Hadoop MapReduce • Many other tools… • Which we will be discussing… now

Apache Sqoop • Top-level Apache project designed for efficient transfer between Apache Hadoop and structured data stores • Use cases?

Apache Flume • Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data • Use cases?

Apache Pig • Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs • Infrastructure compiles language to a sequence of MapReduce programs • Use cases?

Apache Hive • Data warehouse facilitating querying and managing large datasets • Compiles SQL-like queries into MapReduce programs • Use cases?

Hadoop Streaming • Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer • Use cases?

Which high-level API is for you? • What are you comfortable with? • What are you being told to use?

Apache HBase • Distributed, scalable, big data store • Data stored as sorted key/value pairs, with the key consisting of a row and column • Use cases?

Apache Accumulo • Robust, scalable, high-performance data storage and retrieval key/value store • Cell-based access controls • i.e. cell-level security • Use cases?

Apache Avro • Data serialization system for the Hadoop ecosystem • Use cases?

Parquet • Columnar storage format for Hadoop • Use cases?

Apache Mahout • Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce • Use cases?

Apache Oozie • Workflow scheduler system to manage Apache Hadoop jobs • Use cases?

Storm • Distributed real-time computation system • How is this different than MapReduce? • Use cases?

ZooKeeper • Effort to develop and maintain and open-source server enabling highly reliable distributed coordination • Use cases?

SQL on Hadoop • Apache Drill, Cloudera Impala, Facebook Presto, HortonworksStinger, Pivotal HAWQ, etc. • SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store • Use cases? Non use cases?

We won’t be covering these in detail later on Other Hadoop Projects

Apache Cassandra • NoSQL database for managing large amounts of structured, semi-structured, and unstructured data • Support for clusters spanning multiple datacenters • Unlike HBase and Accumulo, data is not stored on HDFS • Use cases? Non use cases?

Azkaban • Batch workflow job scheduler to run Hadoop jobs • Use cases?

Apache Spark • Fast and general engine for large-scale data processing • Write applications in Java, Scala, or Python • Use cases?

Shark • Large-scale data warehouse for Spark and compatible with Apache Hive • Use cases?

Redis, Memcached, etc. • Open-source in-memory key/value stores • Use cases?

Review • A lot of projects available to you for your group work • Feel free to explore and use other projects than the ones I have listed here • Get permission if you plan on using it as part of your group work project quota

References • *.apache.org • parquet.io • storm-project.net • redis.io • spark.incubator.apache.org • incubator.apache.org/drill • github.com/amplab/shark/wiki

Hadoop Ecosystem Overview

Hadoop Ecosystem Overview

Presentation Transcript

Millennium Ecosystem Assessment Overview

Hadoop

Adding Search to the Hadoop Ecosystem

Energy Ecosystem Overview

Hadoop , Hadoop , Hadoop !!!

KeyStone Software Ecosystem Overview

Aquatic Ecosystem Overview:

A Hadoop Overview

Securing the Hadoop Ecosystem

Hadoop Overview

Spark in the Hadoop Ecosystem

SEEA Experimental Ecosystem Accounting Overview

Corporate Ecosystem Services Review Overview

HDFS - Hadoop Overview 2-

Ecosystem Services: a brief overview

Learn Top 12 Hadoop Ecosystem Components

Hadoop Ecosystem In Google Cloud Platform (GCP) | Tudip

ECOSYSTEM ORGANIZATION And POPULATIONS OVERVIEW

Big Data Overview of apache Hadoop