1 / 49

Introduction to Spark, Shark, and Spark Streaming

This talk provides an overview of Spark, Shark, and Spark Streaming, including their architecture, deployment methodology, and performance. It also explores how they fit together with BDAS and Hadoop.

dbernardi
Download Presentation

Introduction to Spark, Shark, and Spark Streaming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark, Shark and Spark Streaming Introduction Part2 Tushar Kale tusharkale@in.ibm.com June 2015

  2. This Talk • Introduction to Shark, Spark and Spark Streaming • Architecture • Deployment Methodology • Performance • References

  3. Data Processing Stack Data Processing Layer Resource Management Layer Storage Layer

  4. Hadoop Stack Hive Pig HBase Storm Data Processing Layer … Hadoop MR Resource Management Layer Hadoop Yarn Storage Layer HDFS, S3, …

  5. BDAS Stack MLBase Spark Streaming BlinkDB GraphX Data Processing Layer Shark SQL MLlib Spark Resource Management Layer Mesos HDFS, S3, … Storage Layer Tachyon

  6. How do BDAS & Hadoop fit together? MLBase Spark Streaming Spark Streaming BlinkDB BlinkDB Graph X MLbase GraphX Hive Pig HBase Storm Shark SQL ML library Shark SQL MLlib Spark Hadoop MR Spark Mesos Mesos Hadoop Yarn HDFS, S3, … Tachyon

  7. Apache Mesos • Enable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, Spark) • Twitter’s large scale deployment • 6,000+ servers, • 500+ engineers running jobs on Mesos • Third party Mesos schedulers • AirBnB’s Chronos • Twitter’s Aurora • Mesospehere: startup to commercialize MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark Mesos HDFS, S3, … Tachyon

  8. Apache Spark Mesos • Distributed Execution Engine • Fault-tolerant, efficient in-memory storage (RDDs) • Powerful programming model and APIs (Scala, Python, Java) • Fast: up to 100x faster than Hadoop • Easy to use: 5-10x less code than Hadoop • General: support interactive & iterative apps • Two major releases since last AMPCamp MLBase Spark Streaming. BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon

  9. Spark Streaming Mesos • Large scale streaming computation • Implement streaming as a sequence of <1s jobs • Fault tolerant • Handle stragglers • Ensure exactly one semantics • Integrated with Spark: unifies batch, interactive, and batch computations • Alpha release (Spring, 2013) MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon

  10. Shark Mesos • Hive over Spark: full support for HQL and UDFs • Up to 100x when input is in memory • Up to 5-10x when input is on disk • Running on hundreds of nodes at Yahoo! • Two major releases along Spark MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon

  11. Unified Programming Models • Unified system for SQL, graph processing, machine learning • All share the same set of workers and caches

  12. BlinkDB Mesos • Trade between query performance and accuracy using sampling • Why? • In-memory processing doesn’t guarantee interactive processing • E.g., ~10’s sec just to scan 512 GB RAM! • Gap between memory capacity and transfer rate increasing MLBase Spark Streaming BlinkDB GraphX doubles every 18 months Shark MLlib 512GB Spark doubles every 36 months HDFS, S3, … Tachyon 40-60GB/s 16 cores

  13. Key Insights Mesos • Input often noisy:exact computations do not guarantee exact answers • Error often acceptable if small and bounded • Main challenge: estimate errors for arbitrary computations • Alpha release (August, 2013) • Allow users to build uniform and stratified samples • Provide error bounds for simple aggregate queries MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon

  14. GraphX Mesos • Combine data-parallel and graph-parallel computations • Provide powerful abstractions: • PowerGraph, Pregel implemented in less than 20 LOC! • Leverage Spark’s fault tolerance • Alpha release: expected this fall MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon

  15. MLlib and MLbase Mesos • MLlib: high quality library for ML algorithms • Will be released with Spark 0.8 (September, 2013) • MLbase: make ML accessible to non-experts • Declarative interface: allow users to say what they want • E.g., classify(data) • Automatically pick best algorithm for given data, time • Allow developers to easily add and test new algorithms • Alpha release of MLI, first component of MLbase, in September, 2013 MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon

  16. Tachyon Mesos • In-memory, fault-tolerant storage system • Flexible API, including HDFS API • Allow multiple frameworks (including Hadoop) to share in-memory data • Alpha release (June, 2013) MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon

  17. Compatibility to Existing Ecosystem Acceptinputs from Kafka, Flume, Twitter, TCP Sockets, … GraphLab API MLBase Spark Streaming BlinkDB GraphX Shark SQL MLlib Hive API Spark Mesos Resource Management Layer Support Hadoop, Storm, MPI HDFS, S3, … Tachyon Storage Layer HDFS API

  18. Summary • BDAS: address next Big Data challenges • Unify batch, interactive, and streaming computations • Easy to develop sophisticate applications • Support graph & ML algorithms, approximate queries • Witnessed significant adoption • 20+ companies, 70+ individuals contributing code • Exciting ongoing work • MLbase, GraphX, BlinkDB, … Spark Batch Interactive Streaming

  19. This Talk • Introduction to Shark, Spark and Spark Streaming • Architecture • Deployment Methodology • Performance • References

  20. RDDs • Three methods for creation • Parallelizing an existing collection • Referencing a dataset • From another RDD • Dataset from any storage supported by Hadoop • HDFS • Cassandra • HBase • Amazon S3 • Others • File types supported • Text files • SequenceFiles • Hadoop InputFormat

  21. Scala and Python • Spark comes with two shells • Scala • Python • APIs available for Scala, Python and Java • Appropriate versions for each Spark release • Spark’s native language is Scala, more natural to write Spark applications using Scala. • This presentation will focus on code examples in Scala

  22. Spark’s Scala and Python Shell • Powerful tool to analyze data interactively • The Scala shell runs on the Java VM • Can leverage existing Java libraries • Scala: • To launch the Scala shell (from Spark home directory): ./bin/spark-shell • To read in a text file: scala> valtextFile = sc.textFile("README.txt") • Python: • To launch the Python shell (from Spark home directory): ./bin/pyspark • To read in a text file: >>> textFile = sc.textFile("README.txt")

  23. Scala • ‘Scalable Language’ • Object oriented, functional programming language • Runs in a JVM • Java Interoperability • Functions are passable objects • Two approaches • Anonymous function syntax x => x + 1 • Static methods in a global singleton object object MyFunctions { def func1 (s: String): String = {…} } myRdd.map(MyFunctions.func1)

  24. Code Execution (1) • ‘spark-shell’ provides Spark context as ‘sc’ // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() File: sparkQuotes.txt DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible

  25. Code Execution (2) // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() RDD: quotes File: sparkQuotes.txt DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible

  26. Code Execution (3) // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() RDD: danQuotes RDD: quotes File: sparkQuotes.txt DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible

  27. Code Execution (4) // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() RDD: danQuotes RDD: quotes File: sparkQuotes.txt RDD: danSpark DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible

  28. Code Execution (5) // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() RDD: danQuotes RDD: quotes File: sparkQuotes.txt RDD: danSpark HadoopRDD DAN Spark is cool DAN Scala is awesome DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible Spark Scala 1

  29. RDD Transformations • Transformations are lazy evaluations • Returns a pointer to the transformed RDD Full documentation at http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.package

  30. RDD Actions • Actions returns values Full documentation at http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.package

  31. RDD Persistence • Each node stores any partitions of the cache that it computes in memory • Reuses them in other actions on that dataset (or datasets derived from it) • Future actions are much faster (often by more than 10x) • Two methods for RDD persistence: persist() and cache()

  32. Quick Introduction to Data Frames • Experimental API introduced in Spark 1.3 • Distributed Collection of Data organized in Columns • Targeted at Python ecosystem • Equivalent to Tables in Databases or DataFrame in R/PYTHON • Much richer optimization than any other implementation of DF • Can be constructed from a wide variety of sources and APIs

  33. Create a DataFrame valdf = sqlContext.jsonFile("/home/ned/attendees.json") df.show() df.printSchema() df.select ("First Name").show() df.select("First Name","Age").show() df.filter(df("age")>40).show() df.groupBy("age").count().show()

  34. Create a DataFrame from an RDD case class attendees_class (first_name: String, last_name:String, age:Int) Val attendees=sc.textFile("/home/ned/attendees.csv").map(_.split(",")).map(p=>attendees_class(p(0),p(1),p(2).trim.toInt)).toDF() people.registerTempTable("attendees") valyoungppl=sqlContext.sql("select first_name,last_name from attendees where age <35") youngppl.map(t=>"Name: " +t(0)+ " " + t(1)).collect().foreach(println)

  35. SparkContext in Applications • The main entry point for Spark functionality • Represents the connection to a Spark cluster • Create RDDs, accumulators, and broadcast variables on that cluster • In the Spark shell, the SparkContext, sc, is automatically initialized for you to use • In a Spark program, import some classes and implicit conversions into your program: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf

  36. A Spark Standalone Application in Scala Import statements Transformations and Actions SparkConf and SparkContext

  37. Running Standalone Applications • Define the dependencies • Scala - simple.sbt • Create the typical directory structure with the files • Create a JAR package containing the application’s code. • Scala: sbt package • Use spark-submit to run the program

  38. Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter(func = _.contains(...)) map(func = _.split(...))

  39. Which Language Should I Use? • Standalone programs can be written in any, but interactive shell is only Python & Scala • Python users: can do Python for both • Java users: consider learning Scala for shell Performance: Java & Scala are faster due to static typing, but Python is often fine

  40. Variables: var x: Int = 7var x = 7 // type inferred val y = “hi”// read-only Scala Cheat Sheet Functions: def square(x: Int): Int = x*x def square(x: Int): Int = { x*x // last line returned} Collections and closures: valnums = Array(1, 2, 3) nums.map((x: Int) => x + 2) // {3,4,5}nums.map(x => x + 2) // samenums.map(_ + 2) // same nums.reduce((x, y) => x + y) // 6nums.reduce(_ + _) // same Java interop: importjava.net.URL new URL(“http://cnn.com”).openStream() More details: scala-lang.org

  41. Spark in Scala and Java // Scala: val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();

  42. This Talk • Introduction to Shark, Spark and Spark Streaming • Architecture • Deployment Methodology • Performance • References

  43. Behavior with Less RAM

  44. Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency • Tested with 100 text streams on 100 EC2 instances with 4 cores each

  45. Performance and Generality(Unified Computation Models) Streaming (SparkStreaming) Batch (ML, Spark) Interactive (SQL, Shark)

  46. Example: Video Quality Diagnosis Top 10 worse performers identical! 440x faster! Latency: 772.34 sec (17TB input) Latency: 1.78 sec (1.7GB input)

  47. This Talk • Introduction to Shark, Spark and Spark Streaming • Architecture • Deployment Methodology • Implementation Next Steps • References

  48. https://amplab.cs.Berkeley.edu/software • www.bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/ • www.ibm.com/analytics/us/en/technology/spark/ • https://amplab.cs.berkeley.edu/software/

  49. THANK YOU

More Related