MarketShare Big Data Analytics

MarketShare Big Data Analytics An Big Data Analytics architecture for the cloud

With and Without Buzzwords Big Dataanalytics on the Cloud • >50 TB analytics pipeline on Amazon Web ServicesusingHadoop

MarketShare: Big Data Analytics on the Cloud • Cloud architecture evolution • A real life big data workflow on the cloud • Fixing issues in big data workflow * Source: Interbrand 2011 report MarketShare confidential and proprietary

Cloud + Big Data

Traditional 3 tier architecture Master App Server User Application Server Database Server Eliminate accessibility restrictions

Moving to web based applications Corporate Data Center Master App Server Web Server User Database Server(s) Application Server(s) Laptop Eliminate Hardware Expenditure Storage Server(s)

Moving to the cloud Amazon Web Services Cloud Web Server App Server EC2 Instance EC2 Instance User Laptop EC2 Sharded Instance with MySQL Eliminate Distributed Storage

Moving big data to Hadoop Amazon Web Services Cloud Web Server App Server EC2 Instance EC2 Instance User Laptop EC2 Instance with MySQL Eliminate distributed systems mgmt. HDFS Cluster

Compute Elasticity Amazon EC2 On Demand Instances Amazon EC2 permanent Instances App Server Web Server Amazon Elastic MapReduce EC2 Instance EC2 Instance User Laptop On demand hadoop instances EC2 Instance with MySQL

Storage Elasticity Amazon EC2 On Demand Instances Amazon EC2 permanent Instances App Server Web Server Amazon Elastic MapReduce EC2 Instance EC2 Instance User Amazon Managed Storage Laptop EC2 Instance with MySQL Managed Database Services Amazon Simple Storage Service (S3) RDS Database Instance

Network Elasticity Amazon EC2 On Demand Instances Amazon EC2 permanent Instances Web Server App Server EC2 Instance EC2 Instance Amazon Elastic MapReduce User Amazon Managed Storage Amazon EC2 permanent Instances Laptop Elastic Load Balancer Web Server App Server Network Elasticity EC2 Instance EC2 Instance Amazon Simple Storage Service (S3) RDS Database Instance

Defining the Cloud • Cloud = Managed Storage + Network Elasticity + On Demand Compute

An Example Big Data Workflow

Raw Data Repository Transformation Validation Provision Cluster Customer Data Repository .................................... .................................... -- GUID Propagation CREATE TABLE userid_guid_mapAS SELECT distinct user_id, guid from dbact_guid; CREATE TABLE mapcountAS SELECT COUNT(*) AS cnt, user_id FROM userid_guid_mapGROUP BY user_id; CREATE TABLE impr_guidAS SELECTa.*, m.guidFROM impri JOIN userid_guid_mapm ON (i.user_id=m.user_id); SELECT COUNT(*) FROM impr_guid; .................................... .................................... .................................... Provision Cluster SCP FTP Raw Data Repository Generic Schema LFIS Funnel Transformation Validation hadoopfs -put Sequencing Validation Summarization Validation MR job not launched Estimate Cluster Size GUID propagation Verify Cluster Topology Mapred job timeout 95% done Fix Cluster & Deploy Binaries Hadoop stopped responding Cleaning & truncation Stats transformation StatisticalAnalysis Joining with dimensions Transform in Generic Schema LUIS Report Generation and Visualization Generic Schema Add Nodes --Exception Handling !<cmd> Rebalance Partitions Restart Terminate Cluster

The big picture

Data Management on the Cloud Big Data + Dynamic Provisioning

Defining the cloud for databases • A dynamically provisioned commodity cluster of virtual machines with the following characteristics • Infinite • A large number of nodes can be commissioned in minutes • Taxi Meter • Most services are billed at an hourly usage level

3 new problems for database query processing • When should resources be added to a data processing system? • Partition management for Low Cost on the cloud • How many of these resources should be permanent? • Materialization with Intermittent Scalability • Where should these resources be added in the stack? • Replication to Improve Query Performance

Transforming Data to Actionable Insights Campaign Heat Map Time Cube Materialization • Incremental updates • Multi-dimensional indexes • Multi-dimensional partitions Engagement High Geography Medium Low Publishers

Generic Schema

Life of a Query Query Optimize Query Identify Partitions Access Partitions Consolidate Results 21

Tuple Access Layer Locate the partition for the fast dimension values Note here that STATE is set to ‘*’ but it has already been materialized Therefore, we eliminate any sum across all states On the local partition, do the aggregation to calculate the ‘*’

5 types of partitions maintained in the cloud Exclusive EC2 nodes that are allocated for specific keys Temporary EC2 nodes that host temporary replicas Permanent EC2 nodes that are permanently allocated to service queries Archive S3 storage that is not accessible directly by the query engine Intermittent EC2 nodes that are allocated by the loading engine

Taxi MeterMaking resources transient

Usage Patterns • Usage patterns vary throughout the day and throughout the week • A couple of periods of heavy usage daily, followed by moderate to low usage Data Loading Cube Materialization Users run ad hoc queries to understand campaign exceptions Quick review of key reports by users prior to heading home Users review standard reports looking for campaign exceptions Computing Resources Users make any necessary campaign adjustments 4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm

Traditional Computing Approach • Traditional computing approach buys enough computing resources to meet peak usage demand • Even many cloud “solutions” provide only the peak computing power option with no way to dynamically reallocate the computing resources to match the current usage demand • Result: Substantial waste in computing resources and money Maximum Computing Resources $$ $$ $$ $$ Computing Resources 4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm

“Adaptive” Computing Economics • Finely matching computing resources to user usage patterns can provide a 50% to 90% cost savings versus the traditional computing resource allocation approach • Result: Lower cost with improvements in availability and performance Maximum Computing Resources $$ $$ $$ $$ Adaptive Computing Resources Computing Resources 4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm

Intermittent scalabilityUsing large number of nodes during load time

Managing CapEx with Role Based Clusters SINGLE CLUSTER FOR DATA CLEANSING, LOAD AND QUERY 15TB 100 NODES Monthly Cost = $28,800

Role Based Clusters UI 2 hours daily for load on 10 nodes Query on 5 nodes Monthly Cost = $2,052 Ad Server Data, Search Engine Data DATA CLEANSING QUERY READY PARTITIONS BUILD CUBE HIBERNATE CUBE

Selective replication for hot partitions

Partition level query slowdown • Dynamic statistics • The query execution system logs status for each partition • If a particular partition is regularly lagging behind, it is marked for replication • Static statistics • The query execution system identifies skews in specific partitions • Partitions with size skew etc are marked for replication Operational (EC2) P1 P2 P1 P2 P3 P4 P3 P4 P5 P6 P5 P6

Fixing partition level slowdown • If the query execution system detects SLA violations • Adds two new temporary nodes (Temp 1) • Creates new replicas for the ‘hot’ partitions Operational (EC2) Temporary (EC2) Node 1 Node 2 Node 3 Temp 1 P1 P2 P1 P2 P3 P4 P3 P3 P3 P4 P5 P6 P6 P6 P5 P6

Key level query slowdown • Key Level Dynamic statistics • A particular key takes time for materializing various facets of the cube Operational (EC2) P1 P2 P1 P2 P3 P4 P3 P4 P5 P6 P5 P6

Fixing partition level slowdown • If the query execution system detects SLA violations for a particular key • Adds a new temporary node (Temp 2) • Denormalizes the key such that all data for that key is materialized Temporary (EC2) Node 5 KN Materialized Operational (EC2) Node 1 Node 2 Node 3 P1 P2 P1 P2 P3 P4 P3 P4 P5 P6 P5 P6

Partitions can be in 5 different states Data Retrieval Module Retrieve data from materialized Access replicas if base partition is overwhelmed Access base partitions is not materialized Isolated (EC2) Isolate keys on separate machines Replicated (temporary EC2) Operational (EC2) Create replicas if partition is hot Archive (S3/EBS) Load (temporary EC2) Post load, archive the partitions

Next Steps • Lots of challenges in cloud + modeling • Collaboration opportunities • We are hiring!

MarketShare Big Data Analytics