300 likes | 476 Views
Data Management Platform on Hadoop. (Incubating). Srikanth Sundarrajan Venkatesh Seetharam. whoami. Agenda. Motivation. 1. Falcon Overview. 2. Case Studies. 3. Questions & Answers. 4. MOTIVATION. Data Processing Landscape. Data Processing (Transform/Pipeline). Acquire (Import).
E N D
Data Management Platform on Hadoop (Incubating) SrikanthSundarrajan Venkatesh Seetharam
Agenda Motivation 1 Falcon Overview 2 Case Studies 3 Questions & Answers 4
Data Processing Landscape Data Processing (Transform/Pipeline) Acquire (Import) External data source Replicate (Copy) Export Archive Eviction
Process Management – Relays picture courtersy: http://istockphoto.com/
Late Data Management picture courtersy: http://iwebask.com
Data Retention As Service picture courtersy: http://vimeo.com/
Data Replication As Service picture courtersy: http://boylesmedia.com
Data Acquisition As Service picture courtersy: http://wmpu.org
Operability – Dashboard picture courtersy: http://www.opentrack.ch/
Holistic Declaration of Intent picture courtersy: http://bigboxdetox.com
Entity Dependency Graph Hadoop / Hbase … Cluster External data source depends depends depends feed Process depends
High Level Architecture Apache Falcon Hadoop Config store Entity Oozie CLI/REST Entity status HCatalog Process status / notification Messaging JMS
Feed Schedule Cluster xml Falcon Falcon config store / Graph Feed xml Retention / Replication workflow Oozie Scheduler HDFS Instance Management Catalog service JMS Notification per action
Process Schedule Cluster/feed xml Falcon Falcon config store / Graph Process xml Process workflow Oozie Scheduler HDFS Instance Management Catalog service JMS Notification per available feed
Physical Architecture Falcon Colo1 Scheduler Scheduler Falcon Colo2 Falcon – Prism Global view Falcon Colo3 Scheduler
Multi Cluster – Failover Primary Hadoop Cluster Staged Data Cleansed Data Conformed Data Presented Data BI and Analytics Replication Staged Data Presented Data Failover Hadoop Cluster • Falcon manages workflow, replication or both. • Enables business continuity without requiring full data reprocessing. • Failover clusters require less storage and CPU.
Retention Policies Staged Data Cleansed Data Conformed Data Presented Data Retain 5 Years Retain 3 Years Retain 3 Years Retain Last Copy Only • Sophisticated retention policies expressed in one place. • Simplify data retention for audit, compliance, or for data re-processing.
CASE STUDY Distributed Processing Example: Digital Advertising @ InMobi
Hadoop @ InMobi • About InMobi • Worlds leading independent mobile advertising company • Hadoop usage at InMobi • ~ 6 Clusters • > 1PB of storage • > 5TB new data ingested each day • > 20TB data crunched each day • > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase • > 175K hadoop jobs / day • > 60K Oozie workflows / day • 300+ Falcon feed definitions • 100+ Falcon process definitions
Processing – Single Data Center Ad Request data Impression render event Hourly summary Click event Conversion event Continuous Streaming (minutely) Enrichment (minutely/5 minutely) Summarizer
Global Aggregation Ad Request data Ad Request data Data Center 1 Impression render event Impression render event Consumable global aggregate Hourly summary Hourly summary Click event Click event …….. Conversion event Conversion event Data Center N Continuous Streaming (minutely) Continuous Streaming (minutely) Enrichment (minutely/5 minutely) Enrichment (minutely/5 minutely) Summarizer Summarizer
Future 1 Security 2 Embed Pig/Hive scripts 3 Data Acquisition – file-based Monitoring/Management Dashboard 4
Questions? • Apache Falcon • http://falcon.incubator.apache.org • mailto: dev@falcon.incubator.apache.org • SrikanthSundarrajan • sriksun@apache.org • #sriksun • Venkatesh Seetharam • venkatesh@apache.org • #innerzeal