Cloudera Hadoop Update: Big Data Analytics and Components Overview

May 23nd2012 Matt Mead, Cloudera Hadoop Update Big Data Analytics

What is Hadoop? CORE HADOOP SYSTEM COMPONENTS • Apache Hadoop is an open source platform for data storage and processing that is… • Scalable • Fault tolerant • Distributed Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Provides storage and computation in a single, scalable system.

Why Use Hadoop? Move beyond rigid legacy frameworks. • Hadoop helps you derive the complete value of all your data. • Drives revenue by extracting value from data that was previously out of reach • Controls costs by storing data more affordably than any other platform • Hadoop grows with your business. • Proven at petabyte scale • Capacity and performance grow simultaneously • Leverages commodity hardware to mitigate costs • Hadoop is 100% Apache® licensed and open source. • No vendor lock-in • Community development • Rich ecosystem of related projects • Hadoop handles any data type, in any quantity. • Structured, unstructured • Schema, no schema • High volume, low volume • All kinds of analytic applications 1 2 3

The Need for CDH 1. The Apache Hadoop ecosystem is complex • Many different components – lots of moving parts • Most companies require more than just HDFS and MapReduce • Creating a Hadoop stack is time-consuming and requires specific expertise • Component and version selection • Integration (internal & external) • System test w/end-to-end workflows 2. Enterprises consume software in a certain way • System, not silo • Tested and stable • Documented and supported • Predictable release schedule

Core Values of CDH A Hadoop system witheverything you need for production use. Components of the CDH Stack File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Data Integration Languages / Compilers Fast Read/Write Access APACHE PIG, APACHE HIVE, APACHE MAHOUT APACHE FLUME, APACHE SQOOP APACHE HBASE HDFS, MAPREDUCE Coordination APACHE ZOOKEEPER

The Need for CDH A set of open source components, packaged into a single system. CORE APACHE HADOOP WORKFLOW / COORDINATION Apache Oozie– Server-based workflow engine for Hadoop activities Apache Zookeeper – Highly reliable distributed coordination service HDFS – Distributed, scalable, fault tolerant file system MapReduce – Parallel processing framework for large data sets QUERY / ANALYTICS Apache Hive– SQL-like language and metadata repository Apache Pig – High level language for expressing data analysis programs Apache HBase– Hadoop database for random, real-time read/write access Apache Mahout – Library of machine learning algorithms for Apache Hadoop DATA INTEGRATION Apache Sqoop– Integrating Hadoop with RDBMS Apache Flume – Distributed service for collecting and aggregating log and event data Fuse-DFS – Module within Hadoop for mounting HDFS as a traditional file system GUI / SDK Hue– Browser-based desktop interface for interacting with Hadoop CLOUD Apache Whirr– Library for running Hadoop in the cloud

Core Hadoop Use Cases Two Core Use Cases Applied Across Verticals 2 1 VERTICAL INDUSTRY TERM INDUSTRY TERM Clickstream Sessionization Social Network Analysis Web Engagement Content Optimization Media Mediation Network Analytics Telco ADVANCED ANALYTICS DATA PROCESSING Data Factory Loyalty & Promotions Analysis Retail Trade Reconciliation Fraud Analysis Financial SIGINT Entity Analysis Federal Bioinformatics Genome Mapping Sequencing Analysis

FMV & Image Processing Data Processing – Full Motion Video & Image Processing • Record by record -> Easy Parallelization • “Unit of work” is important • Raw data in HDFS • Adaptation of existing image analyzers to Map Only / Map Reduce • Scales horizontally • Simple detections • Vehicles • Structures • Faces

Cybersecurity Analysis Advanced Analytics – Cybersecurity Analysis • Rates and flows – ingest can be in excess of the multiple gigabyte per second range • Can be complex because of mixed-workload clusters • Typically involves ad-hoc analysis • Question oriented analytics • “Productionized” use cases allow insight by non-analysts • Existing open source solution SHERPASURFING • Focuses on the cybersecurity analysis underpinnings for common data-sets (pcap, netflow, audit logs, etc.) • Provides a means to ask questions without reinventing all the plumbing

Index Preparation Data Processing – Index Preparation • Hadoop’s Seminal Use Case • Dynamic Partitioning -> Easy Parallelization • String Interning • Inverse Index Construction • Dimensional data capture • Destination indices • Lucene/Solr (and derivatives) • Endeca • Existing solution USA Search (http://usasearch.howto.gov/)

Data Landing Zone Data Processing – Schema-less Enterprise Data Warehouse / Landing Zone • Begins as storage, light ingest processing, retrieval • Capacity scales horizontally • Schema-less -> holds arbitrary content • Schema-less -> allows ad-hoc fusion and analysis • Additional analytic workload forces decisions

Hadoop: Getting Started • Reactive • Forced by scale or cost of scaling • Proactive • Seek talent ahead of need to build • Identify data-sets • Determine high-value use cases that change organizational outcomes • Start with 10-20 nodes and 10+TB unless data-sets are super-dimensional • Either way • Talent a major challenge • Start with “Data Processing” use cases • Physical infrastructure is complex, make the software infrastructure simple to manage

Customer Success Self-Source Deployment vs. Cloudera Enterprise – 500 node deployment Option 2: Self-SourceEstimated Cost: $4.8 millionDeployment Time: ~ 6 Months $5M $4M Cost, $Millions $3M Option 1: Use Cloudera EnterpriseEstimated Cost: $2 millionDeployment Time: ~ 2 Months $2M $1M 1 2 3 4 5 6 Time required for Production Deployment (Months) Note: Cost estimates include personnel, software & hardware Source: Cloudera internal estimates

Customer Success Cloudera Enterprise Subscription vs. Self-Source

Contact Us • Erin Hawley • Business Development, Cloudera DoD Engagement • ehawley@cloudera.com • Matt Mead • Sr. Systems Engineer, Cloudera Federal Engagements • mmead@cloudera.com

Cloudera Hadoop Update: Big Data Analytics and Components Overview