1.33k likes | 1.58k Views
Spark - Shark Data Analytics Stack on a Hadoop Cluster. April 22, 2013. Big Data Week Data Science Group. April 23, 2013. Michael Malak Data Analytics Senior Engineer at Time Warner Cable T echnicaltidbit.com . Chris Deptula Senior Big Data Consultant 317.840.2935
E N D
Spark - Shark Data Analytics Stack on a Hadoop Cluster April 22, 2013
Big Data Week Data Science Group April 23, 2013
Michael Malak Data Analytics Senior Engineer at Time Warner Cable Technicaltidbit.com
Chris Deptula Senior Big Data Consultant 317.840.2935 chris.deptula@openbi.com @chrisdeptula http://www.openbi.com
Michael Walker Managing Partner 720.373.2200 m@rosebt.com http://www.rosebt.com
Agenda The Big Data Problem Spark Ecosystem NFL Data Science Use Case Visualizing Data
Agenda • What Hadoop gives us • What everyone is complaining about in 2013 • Spark • Berkeley Team • BDAS (Berkeley Data Analytics Stack) • RDDs (Resilient Distributed Datasets) • Shark • Spark Streaming • Other Spark subsystems technicaltidbit.com
What Hadoop Gives Us HDFS Map/Reduce technicaltidbit.com
Hadoop: HDFS Image from mark.chmarny.com technicaltidbit.com
Hadoop: Map/Reduce Image from blog.octo.com Image from people.apache.org/~rdonkin technicaltidbit.com
Map/Reduce Tools Pig Script HiveQL Hbase App Hive Pig Hadoop Linux technicaltidbit.com
Hadoop Distribution Dogs in the Race Hadoop Distribution Query Tool Apache Drill Stinger technicaltidbit.com
Other Open Source Solutions Druid Spark technicaltidbit.com
Not just caching, but streaming 1st generation: HDFS 2nd generation: Caching & “Push” Map/Reduce 3rd generation: Streaming technicaltidbit.com
Berkeley Team Image from Ian Stoica’s slides from Strata 2013 presentation 40 students 8 faculty 3 staff software engineers Silicon Valley style skunkworks office space 2 years into 6 year program technicaltidbit.com
BDAS(Berkeley Data Analytics Stack) Bagel App Shark App Spark Streaming App Bagel Shark Spark Streaming Spark App Spark Hadoop/HDFS Mesos Linux technicaltidbit.com
RDDs(Resilient Distributed Dataset) Image from MateiZaharia’s paper technicaltidbit.com
RDDs: Laziness x => x.startsWith(“ERROR”) lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) .filter(_.contains(“foo”)) cnt = errors.count All Lazy Action! technicaltidbit.com
RDDs: Transformations vs. Actions Transformations map(func) filter(func) flatMap(func) sample(withReplacement, frac, seed) union(otherDataset) groupByKey[K,V](func) reduceByKey[K,V](func) join[K,V,W](otherDataset) cogroup[K,V,W1,W2](other1, other2) cartesian[U](otherDataset) sortByKey[K,V] Actions reduce(func) collect() count() take(n) first() saveAsTextFile(path) saveAsSequenceFile(path) foreach(func) [K,V] in Scala same as <K,V> templates in C++, Java technicaltidbit.com
Hive vs. Shark Shark HiveQL HiveQL HDFS files + HDFS files RDDs technicaltidbit.com
Shark: Copy from HDFS to RDD CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki; CREATE TABLE wiki_cached AS SELECT * FROM wiki; Creates a table that is stored in a cluster’s memory using RDD.cache(). technicaltidbit.com
Shark: Just a Shim Shark Images from ReynoldXin’s presentation technicaltidbit.com
What about “Big Data”? PB TB Shark Effectiveness GB MB KB technicaltidbit.com
Median Hadoop job input size Image from Reynold Xin’s presentation technicaltidbit.com
Spark Streaming: Motivation HDFS x1,000,000 clients technicaltidbit.com
Spark Streaming: DStream DStream RDD {{“id”: “hercman”}, “eventType”: “buyGoods”}} {{“id”: “hercman”}, “eventType”: “buyGoods”}} {{“id”: “shewolf”}, “eventType”: “error”}} 2 sec RDD {{“id”: “shewolf”}, “eventType”: “error”}} 2 sec . . . RDD {{“id”: “catlover”}, “eventType”: “buyGoods”}} {{“id”: “hercman”}, “eventType”: “logOff”}} 2 sec “A series of small batches” technicaltidbit.com