Hadoop 101

Hadoop 101 Just the basics…and a bit more if you are interested Hortonworks

Hortonworks & Hadoop: A Long History • Apache Project Established • Yahoo! begins to • Operate at scale • HortonworksData Platform • 2013 • 2004 • 2006 • 2008 • 2010 • 2012 • EnterpriseHadoop Focus on INNOVATION • 2005: Hadoop created at Yahoo! to solve SEARCH • 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Focus on OPERATIONS STABILITY • 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 key Hadoop engineers from Yahoo

Hadoop: It’s About Scale & Structure Traditional RDBMS SCALE (storage & processing) Hadoop schema Required on read Required on write governance Loosely structured Standards and structured processing Processing coupled with data Limited, no data processing data types Multi and unstructured Structured transactions Optimized for analytics Optimized, reliable best fit use Data Discovery Processing unstructured data Massive Storage/Processing Interactive OLAP Analytics Complex ACID Transactions Operational Data Store

Apache Hadoop & A Hadoop “Distribution” • Apache Hadoop Is a project • Governed by Apache Software Foundation (ASF) • Comprises core services of MapReduce, YARN and HDFS • Hadoop Distribution is a package of projects • Packages Apache Hadoop and related Apache projects • In addition to the core, includes functions across • Data services which allow you to manipulate and move data (Hive, HBase, Pig, Flume, Sqoop) • Operational services which help manage the cluster (Ambari, Falcon and Oozie) • Tested for consistency across the entire package • Hardened for the enterprise

Apache Hadoop Core Hadoop is a distributed storage& processingtechnology Key Characteristics • Scalable • Efficiently store and process petabytes of data • Reliable • Redundant storage • Failover across nodes and racks • Flexible • Apply schema on analysis and sharing of the data • Economical • Use commodity hardware • Open source software guards against vendor lock-in CORE SERVICES Process Resource Management Storage Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots

Apache Hadoop Core Hadoop is a distributed storage& processingtechnology • HDFSDistributed file system for storing large quantities of data across multiple nodes on commodity hardware • MapReduceFramework for processing data stored in HDFS • YARN (new in 2.0)Resource management framework that mediates application access to data stored in HDFS CORE SERVICES MAP REDUCE YARN HDFS Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots

Functional Translation of a Hadoop Distribution OPERATIONAL SERVICES DATASERVICES Provision, Manage & Monitor the cluster Manage, Move & Access Data CORE SERVICES Store and Process Data HORTONWORKS DATA PLATFORM (HDP)

A Functional Translation of a Hadoop Distribution OPERATIONAL SERVICES DATASERVICES Cluster Mgmnt Dataset Mgmnt Data Access Data Movement Schedule CORE SERVICES Process Resource Management Storage Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP)

A Hadoop Distribution: More Than Hadoop Core Hortonworks Data Platform (HDP) • Core Services • Storage & processing • Data Services • Movement and interaction • Operational Services • Management, monitoring • Platform Services • The “ilities” OPERATIONAL SERVICES DATASERVICES AMBARI FALCON PIG HIVE FLUME SQOOP HBASE OOZIE CORE SERVICES MAP TEZ REDUCE YARN HDFS Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP)

Apache Flume: Loading Stream Data Apache Flume Store Log Files & Events • Distributed service for efficiently collecting, aggregating, and moving streams of log data into HDFS • Primary use case: move web log files directly into Hadoop OPERATIONAL SERVICES DATASERVICES Provision, Manage & Monitor the cluster PIG HIVE FLUME HBASE SQOOP CORE SERVICES Store and Process Data

Apache Sqoop: Loading Databases Apache Sqoop Get Data from/to SQL Databases • SQ-OOP: SQL to Hadoop • Tools and connectors that enable data from traditional SQL databases and data warehouses to be stored to & retrieved from Hadoop OPERATIONAL SERVICES DATASERVICES Provision, Manage & Monitor the cluster PIG HIVE FLUME HBASE SQOOP CORE SERVICES Store and Process Data

Apache HBase: NoSQL for Interactive Apps Apache HBase NoSQL data store for Interactive Apps • Non-relational, columnar database that allows for high speed access to data in Hadoop • Commonly used to enrich existing applications by incorporating data stored in HDFS by loading data directly into HBase for access by an on-line application (e.g. recommendation engine) OPERATIONAL SERVICES DATASERVICES Provision, Manage & Monitor the cluster PIG HIVE FLUME HBASE SQOOP CORE SERVICES Store and Process Data

Apache Pig: Scripting in Hadoop Apache Pig Scripting Interface for Hadoop • Write complex data transformations using a simple scripting language • Pig latin (the language) defines a set of transformations on a data set such as aggregate, join and sort among others OPERATIONAL SERVICES DATASERVICES Provision, Manage & Monitor the cluster PIG HIVE FLUME HBASE SQOOP CORE SERVICES Store and Process Data

Apache Hive: SQL in Hadoop Apache Hive SQL interface in Hadoop • De-facto SQL interface, enables world of tools on Hadoop • Scales from GB to PB across all queries • Good for both batch and interactive queries • First application to use Apache Tez OPERATIONAL SERVICES DATASERVICES Provision, Manage & Monitor the cluster PIG HIVE FLUME HBASE SQOOP CORE SERVICES Store and Process Data TEZ

Operational services Hadoop needs operational services for productive operations & management OPERATIONAL SERVICES DATASERVICES Provision, Manage & Monitor the cluster Manage, Move & Access Data CORE SERVICES Store and Process Data

Apache Ambari: Management & Monitoring Apache Ambari Provision, Manage and monitor a Hadoop Cluster • Intuitive user interface that makes controlling a cluster easy and productive • Operational metrics and dashboards for insight into overall health • Designed as a standalone tool or integrated with existing management tools OPERATIONAL SERVICES DATASERVICES AMBARI FALCON Manage, Move & Access Data OOZIE CORE SERVICES Store and Process Data

Apache Ambari: Management & Monitoring Apache Oozie Schedule management for Hadoop • Allows you to schedule backend functions of Hadoop • Largely operational but highly important to maintain efficient use of a cluster OPERATIONAL SERVICES DATASERVICES AMBARI FALCON Manage, Move & Access Data OOZIE CORE SERVICES Store and Process Data

Apache Falcon: Data Lifecycle Management Apache Falcon Dataset Lifecycle Management • Automate the movement and processing of datasets for ingest, pipelines, disaster recovery and data retention use cases • Manage & maintain data flows • Enables visibility into data lineage and traceability OPERATIONAL SERVICES DATASERVICES AMBARI FALCON Manage, Move & Access Data OOZIE CORE SERVICES Store and Process Data

A Functional Translation of a Hadoop Distribution OPERATIONAL SERVICES DATASERVICES Cluster Mgmnt Dataset Mgmnt Data Access Data Movement Schedule CORE SERVICES Process Resource Management Storage Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP)

HDP: Reliable, Consistent & Current HDP demonstrates most recent community innovation Hadoop Pig HCatalog Hive HBase Sqoop Flume Oozie Zookeeper Mahout Ambari 1.4.1 OCT 0.12.0 HDP 2.0 2013 0.96.0 0.12.0 1.4.4 2.2.0 0.11.0 0.8.0 4.0.0 1.31 1.2.3 0.94.6 0.11 1.4.3 3.4.5 May HDP 1.3 3.3.2 2013 0.5.0 0.7.0 0.10.0 1.2.0 0.94.2 1.1.2 3.2.0 0.10.1 1.4.2 FEB HDP 1.2 1.30 0.9.0 3.3.4 2013 HMC1.1 0.92.1 0.9.2 3.1.3 0.4.0 1.4.1 SEPT 1.0.3 HDP 1.1 HMC1 2012 JUNE HDP 1.0 2012 Hortonworks Data Platform

Future of Hadoop: Beyond Batch A shift from the old to the new… Single Use System Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … HADOOP 1.0 HADOOP 2.0 MapReduce (batch) Tez (interactive) Others (varied) MapReduce (cluster resource management & data processing) YARN (operating system: cluster resource management) HDFS (redundant, reliable storage) HDFS2 (redundant, reliable storage)

Hadoop: a FLEXIBLE Multi-use Data Platform Apache YARN: the Hadoop 2.0Operating System • Apache YARN Enables data processing models beyond MapReduce (batch), such as interactive, online, streaming and beyond. • Interact with all data in multiple ways simultaneously Data Processing Engines Run Natively IN Hadoop OTHERS REEF LASR, HPA BATCH MapReduce INTERACTIVE Tez ONLINE HBase STREAMING Storm GRAPH Giraph YARN (operating system: cluster resource management) HDFS2 (redundant, reliable storage)

An Emerging Data Architecture APPLICATIONS Custom Applications Packaged Applications Business Analytics OPERATIONAL TOOLS DEV & DATA TOOLS MANAGE & MONITOR BUILD & TEST DATA SYSTEM REPOSITORIES RDBMS EDW MPP SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

One Hadoop: Deep Integration APPLICATIONS DEV & DATATOOLS OPERATIONAL TOOLS DATA SYSTEM RDBMS EDW MPP HANA BusinessObjects BI INFRASTRUCTURE SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

Try Hadoop Today Download the Hortonworks Sandbox Learn Hadoop Build a Proof of Concept Test New Functionality THANK YOU!

Hadoop 101

Hadoop 101

Presentation Transcript

Hadoop

Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop @

Hadoop on Azure 101 What is the Big Deal?

Cassandra + Hadoop

Hadoop Demo

Hola Hadoop

Hadoop

HADOOP

Hadoop

Hadoop

Hadoop

Hadoop