150 likes | 172 Views
May 23 nd 2012 Matt Mead, Cloudera. Hadoop Update Big Data Analytics. What is Hadoop?. CORE HADOOP SYSTEM COMPONENTS. Apache Hadoop is an open source platform for data storage and processing that is… Scalable Fault tolerant Distributed. Hadoop Distributed File System (HDFS)
E N D
May 23nd2012 Matt Mead, Cloudera Hadoop Update Big Data Analytics
What is Hadoop? CORE HADOOP SYSTEM COMPONENTS • Apache Hadoop is an open source platform for data storage and processing that is… • Scalable • Fault tolerant • Distributed Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Provides storage and computation in a single, scalable system.
Why Use Hadoop? Move beyond rigid legacy frameworks. • Hadoop helps you derive the complete value of all your data. • Drives revenue by extracting value from data that was previously out of reach • Controls costs by storing data more affordably than any other platform • Hadoop grows with your business. • Proven at petabyte scale • Capacity and performance grow simultaneously • Leverages commodity hardware to mitigate costs • Hadoop is 100% Apache® licensed and open source. • No vendor lock-in • Community development • Rich ecosystem of related projects • Hadoop handles any data type, in any quantity. • Structured, unstructured • Schema, no schema • High volume, low volume • All kinds of analytic applications 1 2 3
The Need for CDH 1. The Apache Hadoop ecosystem is complex • Many different components – lots of moving parts • Most companies require more than just HDFS and MapReduce • Creating a Hadoop stack is time-consuming and requires specific expertise • Component and version selection • Integration (internal & external) • System test w/end-to-end workflows 2. Enterprises consume software in a certain way • System, not silo • Tested and stable • Documented and supported • Predictable release schedule
Core Values of CDH A Hadoop system witheverything you need for production use. Components of the CDH Stack File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Data Integration Languages / Compilers Fast Read/Write Access APACHE PIG, APACHE HIVE, APACHE MAHOUT APACHE FLUME, APACHE SQOOP APACHE HBASE HDFS, MAPREDUCE Coordination APACHE ZOOKEEPER
The Need for CDH A set of open source components, packaged into a single system. CORE APACHE HADOOP WORKFLOW / COORDINATION Apache Oozie– Server-based workflow engine for Hadoop activities Apache Zookeeper – Highly reliable distributed coordination service HDFS – Distributed, scalable, fault tolerant file system MapReduce – Parallel processing framework for large data sets QUERY / ANALYTICS Apache Hive– SQL-like language and metadata repository Apache Pig – High level language for expressing data analysis programs Apache HBase– Hadoop database for random, real-time read/write access Apache Mahout – Library of machine learning algorithms for Apache Hadoop DATA INTEGRATION Apache Sqoop– Integrating Hadoop with RDBMS Apache Flume – Distributed service for collecting and aggregating log and event data Fuse-DFS – Module within Hadoop for mounting HDFS as a traditional file system GUI / SDK Hue– Browser-based desktop interface for interacting with Hadoop CLOUD Apache Whirr– Library for running Hadoop in the cloud
Core Hadoop Use Cases Two Core Use Cases Applied Across Verticals 2 1 VERTICAL INDUSTRY TERM INDUSTRY TERM Clickstream Sessionization Social Network Analysis Web Engagement Content Optimization Media Mediation Network Analytics Telco ADVANCED ANALYTICS DATA PROCESSING Data Factory Loyalty & Promotions Analysis Retail Trade Reconciliation Fraud Analysis Financial SIGINT Entity Analysis Federal Bioinformatics Genome Mapping Sequencing Analysis
FMV & Image Processing Data Processing – Full Motion Video & Image Processing • Record by record -> Easy Parallelization • “Unit of work” is important • Raw data in HDFS • Adaptation of existing image analyzers to Map Only / Map Reduce • Scales horizontally • Simple detections • Vehicles • Structures • Faces
Cybersecurity Analysis Advanced Analytics – Cybersecurity Analysis • Rates and flows – ingest can be in excess of the multiple gigabyte per second range • Can be complex because of mixed-workload clusters • Typically involves ad-hoc analysis • Question oriented analytics • “Productionized” use cases allow insight by non-analysts • Existing open source solution SHERPASURFING • Focuses on the cybersecurity analysis underpinnings for common data-sets (pcap, netflow, audit logs, etc.) • Provides a means to ask questions without reinventing all the plumbing
Index Preparation Data Processing – Index Preparation • Hadoop’s Seminal Use Case • Dynamic Partitioning -> Easy Parallelization • String Interning • Inverse Index Construction • Dimensional data capture • Destination indices • Lucene/Solr (and derivatives) • Endeca • Existing solution USA Search (http://usasearch.howto.gov/)
Data Landing Zone Data Processing – Schema-less Enterprise Data Warehouse / Landing Zone • Begins as storage, light ingest processing, retrieval • Capacity scales horizontally • Schema-less -> holds arbitrary content • Schema-less -> allows ad-hoc fusion and analysis • Additional analytic workload forces decisions
Hadoop: Getting Started • Reactive • Forced by scale or cost of scaling • Proactive • Seek talent ahead of need to build • Identify data-sets • Determine high-value use cases that change organizational outcomes • Start with 10-20 nodes and 10+TB unless data-sets are super-dimensional • Either way • Talent a major challenge • Start with “Data Processing” use cases • Physical infrastructure is complex, make the software infrastructure simple to manage
Customer Success Self-Source Deployment vs. Cloudera Enterprise – 500 node deployment Option 2: Self-SourceEstimated Cost: $4.8 millionDeployment Time: ~ 6 Months $5M $4M Cost, $Millions $3M Option 1: Use Cloudera EnterpriseEstimated Cost: $2 millionDeployment Time: ~ 2 Months $2M $1M 1 2 3 4 5 6 Time required for Production Deployment (Months) Note: Cost estimates include personnel, software & hardware Source: Cloudera internal estimates
Customer Success Cloudera Enterprise Subscription vs. Self-Source
Contact Us • Erin Hawley • Business Development, Cloudera DoD Engagement • ehawley@cloudera.com • Matt Mead • Sr. Systems Engineer, Cloudera Federal Engagements • mmead@cloudera.com