1 / 132

Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark - Shark Data Analytics Stack on a Hadoop Cluster. April 22, 2013. Big Data Week Data Science Group. April 23, 2013. Michael Malak Data Analytics Senior Engineer at Time Warner Cable T echnicaltidbit.com . Chris Deptula Senior Big Data Consultant 317.840.2935

falala
Download Presentation

Spark - Shark Data Analytics Stack on a Hadoop Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark - Shark Data Analytics Stack on a Hadoop Cluster April 22, 2013

  2. Big Data Week Data Science Group April 23, 2013

  3. Michael Malak Data Analytics Senior Engineer at Time Warner Cable Technicaltidbit.com

  4. Chris Deptula Senior Big Data Consultant 317.840.2935 chris.deptula@openbi.com @chrisdeptula http://www.openbi.com

  5. Michael Walker Managing Partner 720.373.2200 m@rosebt.com http://www.rosebt.com

  6. Agenda The Big Data Problem Spark Ecosystem NFL Data Science Use Case Visualizing Data

  7. The Big Data Problem

  8. Speed Kills in Data Science

  9. Hype Cycle for Emerging Tech 2012

  10. Hype Cycle for Big Data 2012

  11. Evolution of DW Architecture

  12. Emerging DW Architecture

  13. Next-Generation Data Architecture

  14. Big Data Ecosystem Parts

  15. DW Database Systems MQ 2013

  16. Total Enterprise Data Growth 2005-2015

  17. Structured vs Unstructured Data

  18. Modern DW/BI Analytical Ecosystems

  19. Big Data Ecosystem Parts

  20. The Internet of Things

  21. Big Data 4 V's

  22. New World of Databases

  23. New World of Databases

  24. Hadoop

  25. Hadoop

  26. Hadoop

  27. Big Data Vendor Focused on Hadoop and NoSQL Revenue 2012

  28. Big Data Analytics Infrastructure

  29. The Spark Ecosystem

  30. Agenda • What Hadoop gives us • What everyone is complaining about in 2013 • Spark • Berkeley Team • BDAS (Berkeley Data Analytics Stack) • RDDs (Resilient Distributed Datasets) • Shark • Spark Streaming • Other Spark subsystems technicaltidbit.com

  31. What Hadoop Gives Us HDFS Map/Reduce technicaltidbit.com

  32. Hadoop: HDFS Image from mark.chmarny.com technicaltidbit.com

  33. Hadoop: Map/Reduce Image from blog.octo.com Image from people.apache.org/~rdonkin technicaltidbit.com

  34. Map/Reduce Tools Pig Script HiveQL Hbase App Hive Pig Hadoop Linux technicaltidbit.com

  35. Hadoop Distribution Dogs in the Race Hadoop Distribution Query Tool Apache Drill Stinger technicaltidbit.com

  36. Other Open Source Solutions Druid Spark technicaltidbit.com

  37. Not just caching, but streaming 1st generation: HDFS 2nd generation: Caching & “Push” Map/Reduce 3rd generation: Streaming technicaltidbit.com

  38. Berkeley Team Image from Ian Stoica’s slides from Strata 2013 presentation 40 students 8 faculty 3 staff software engineers Silicon Valley style skunkworks office space 2 years into 6 year program technicaltidbit.com

  39. BDAS(Berkeley Data Analytics Stack) Bagel App Shark App Spark Streaming App Bagel Shark Spark Streaming Spark App Spark Hadoop/HDFS Mesos Linux technicaltidbit.com

  40. RDDs(Resilient Distributed Dataset) Image from MateiZaharia’s paper technicaltidbit.com

  41. RDDs: Laziness x => x.startsWith(“ERROR”) lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) .filter(_.contains(“foo”)) cnt = errors.count All Lazy Action! technicaltidbit.com

  42. RDDs: Transformations vs. Actions Transformations map(func) filter(func) flatMap(func) sample(withReplacement, frac, seed) union(otherDataset) groupByKey[K,V](func) reduceByKey[K,V](func) join[K,V,W](otherDataset) cogroup[K,V,W1,W2](other1, other2) cartesian[U](otherDataset) sortByKey[K,V] Actions reduce(func) collect() count() take(n) first() saveAsTextFile(path) saveAsSequenceFile(path) foreach(func) [K,V] in Scala same as <K,V> templates in C++, Java technicaltidbit.com

  43. Hive vs. Shark Shark HiveQL HiveQL HDFS files + HDFS files RDDs technicaltidbit.com

  44. Shark: Copy from HDFS to RDD CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki; CREATE TABLE wiki_cached AS SELECT * FROM wiki; Creates a table that is stored in a cluster’s memory using RDD.cache(). technicaltidbit.com

  45. Shark: Just a Shim Shark Images from ReynoldXin’s presentation technicaltidbit.com

  46. What about “Big Data”? PB TB Shark Effectiveness GB MB KB technicaltidbit.com

  47. Median Hadoop job input size Image from Reynold Xin’s presentation technicaltidbit.com

  48. Spark Streaming: Motivation HDFS x1,000,000 clients technicaltidbit.com

  49. Spark Streaming: DStream DStream RDD {{“id”: “hercman”}, “eventType”: “buyGoods”}} {{“id”: “hercman”}, “eventType”: “buyGoods”}} {{“id”: “shewolf”}, “eventType”: “error”}} 2 sec RDD {{“id”: “shewolf”}, “eventType”: “error”}} 2 sec . . . RDD {{“id”: “catlover”}, “eventType”: “buyGoods”}} {{“id”: “hercman”}, “eventType”: “logOff”}} 2 sec “A series of small batches” technicaltidbit.com

More Related