Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark - Shark Data Analytics Stack on a Hadoop Cluster April 22, 2013

Big Data Week Data Science Group April 23, 2013

Michael Malak Data Analytics Senior Engineer at Time Warner Cable Technicaltidbit.com

Chris Deptula Senior Big Data Consultant 317.840.2935 chris.deptula@openbi.com @chrisdeptula http://www.openbi.com

Michael Walker Managing Partner 720.373.2200 m@rosebt.com http://www.rosebt.com

Agenda The Big Data Problem Spark Ecosystem NFL Data Science Use Case Visualizing Data

The Big Data Problem

Speed Kills in Data Science

Hype Cycle for Emerging Tech 2012

Hype Cycle for Big Data 2012

Evolution of DW Architecture

Emerging DW Architecture

Next-Generation Data Architecture

Big Data Ecosystem Parts

DW Database Systems MQ 2013

Total Enterprise Data Growth 2005-2015

Structured vs Unstructured Data

Modern DW/BI Analytical Ecosystems

Big Data Ecosystem Parts

The Internet of Things

Big Data 4 V's

New World of Databases

Hadoop

Big Data Vendor Focused on Hadoop and NoSQL Revenue 2012

Big Data Analytics Infrastructure

The Spark Ecosystem

Agenda • What Hadoop gives us • What everyone is complaining about in 2013 • Spark • Berkeley Team • BDAS (Berkeley Data Analytics Stack) • RDDs (Resilient Distributed Datasets) • Shark • Spark Streaming • Other Spark subsystems technicaltidbit.com

What Hadoop Gives Us HDFS Map/Reduce technicaltidbit.com

Hadoop: HDFS Image from mark.chmarny.com technicaltidbit.com

Hadoop: Map/Reduce Image from blog.octo.com Image from people.apache.org/~rdonkin technicaltidbit.com

Map/Reduce Tools Pig Script HiveQL Hbase App Hive Pig Hadoop Linux technicaltidbit.com

Hadoop Distribution Dogs in the Race Hadoop Distribution Query Tool Apache Drill Stinger technicaltidbit.com

Other Open Source Solutions Druid Spark technicaltidbit.com

Not just caching, but streaming 1st generation: HDFS 2nd generation: Caching & “Push” Map/Reduce 3rd generation: Streaming technicaltidbit.com

Berkeley Team Image from Ian Stoica’s slides from Strata 2013 presentation 40 students 8 faculty 3 staff software engineers Silicon Valley style skunkworks office space 2 years into 6 year program technicaltidbit.com

BDAS(Berkeley Data Analytics Stack) Bagel App Shark App Spark Streaming App Bagel Shark Spark Streaming Spark App Spark Hadoop/HDFS Mesos Linux technicaltidbit.com

RDDs(Resilient Distributed Dataset) Image from MateiZaharia’s paper technicaltidbit.com

RDDs: Laziness x => x.startsWith(“ERROR”) lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) .filter(_.contains(“foo”)) cnt = errors.count All Lazy Action! technicaltidbit.com

RDDs: Transformations vs. Actions Transformations map(func) filter(func) flatMap(func) sample(withReplacement, frac, seed) union(otherDataset) groupByKey[K,V](func) reduceByKey[K,V](func) join[K,V,W](otherDataset) cogroup[K,V,W1,W2](other1, other2) cartesian[U](otherDataset) sortByKey[K,V] Actions reduce(func) collect() count() take(n) first() saveAsTextFile(path) saveAsSequenceFile(path) foreach(func) [K,V] in Scala same as <K,V> templates in C++, Java technicaltidbit.com

Hive vs. Shark Shark HiveQL HiveQL HDFS files + HDFS files RDDs technicaltidbit.com

Shark: Copy from HDFS to RDD CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki; CREATE TABLE wiki_cached AS SELECT * FROM wiki; Creates a table that is stored in a cluster’s memory using RDD.cache(). technicaltidbit.com

Shark: Just a Shim Shark Images from ReynoldXin’s presentation technicaltidbit.com

What about “Big Data”? PB TB Shark Effectiveness GB MB KB technicaltidbit.com

Median Hadoop job input size Image from Reynold Xin’s presentation technicaltidbit.com

Spark Streaming: Motivation HDFS x1,000,000 clients technicaltidbit.com

Spark Streaming: DStream DStream RDD {{“id”: “hercman”}, “eventType”: “buyGoods”}} {{“id”: “hercman”}, “eventType”: “buyGoods”}} {{“id”: “shewolf”}, “eventType”: “error”}} 2 sec RDD {{“id”: “shewolf”}, “eventType”: “error”}} 2 sec . . . RDD {{“id”: “catlover”}, “eventType”: “buyGoods”}} {{“id”: “hercman”}, “eventType”: “logOff”}} 2 sec “A series of small batches” technicaltidbit.com

Spark - Shark Data Analytics Stack on a Hadoop Cluster