490 likes | 500 Views
This talk provides an overview of Spark, Shark, and Spark Streaming, including their architecture, deployment methodology, and performance. It also explores how they fit together with BDAS and Hadoop.
E N D
Spark, Shark and Spark Streaming Introduction Part2 Tushar Kale tusharkale@in.ibm.com June 2015
This Talk • Introduction to Shark, Spark and Spark Streaming • Architecture • Deployment Methodology • Performance • References
Data Processing Stack Data Processing Layer Resource Management Layer Storage Layer
Hadoop Stack Hive Pig HBase Storm Data Processing Layer … Hadoop MR Resource Management Layer Hadoop Yarn Storage Layer HDFS, S3, …
BDAS Stack MLBase Spark Streaming BlinkDB GraphX Data Processing Layer Shark SQL MLlib Spark Resource Management Layer Mesos HDFS, S3, … Storage Layer Tachyon
How do BDAS & Hadoop fit together? MLBase Spark Streaming Spark Streaming BlinkDB BlinkDB Graph X MLbase GraphX Hive Pig HBase Storm Shark SQL ML library Shark SQL MLlib Spark Hadoop MR Spark Mesos Mesos Hadoop Yarn HDFS, S3, … Tachyon
Apache Mesos • Enable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, Spark) • Twitter’s large scale deployment • 6,000+ servers, • 500+ engineers running jobs on Mesos • Third party Mesos schedulers • AirBnB’s Chronos • Twitter’s Aurora • Mesospehere: startup to commercialize MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark Mesos HDFS, S3, … Tachyon
Apache Spark Mesos • Distributed Execution Engine • Fault-tolerant, efficient in-memory storage (RDDs) • Powerful programming model and APIs (Scala, Python, Java) • Fast: up to 100x faster than Hadoop • Easy to use: 5-10x less code than Hadoop • General: support interactive & iterative apps • Two major releases since last AMPCamp MLBase Spark Streaming. BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon
Spark Streaming Mesos • Large scale streaming computation • Implement streaming as a sequence of <1s jobs • Fault tolerant • Handle stragglers • Ensure exactly one semantics • Integrated with Spark: unifies batch, interactive, and batch computations • Alpha release (Spring, 2013) MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon
Shark Mesos • Hive over Spark: full support for HQL and UDFs • Up to 100x when input is in memory • Up to 5-10x when input is on disk • Running on hundreds of nodes at Yahoo! • Two major releases along Spark MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon
Unified Programming Models • Unified system for SQL, graph processing, machine learning • All share the same set of workers and caches
BlinkDB Mesos • Trade between query performance and accuracy using sampling • Why? • In-memory processing doesn’t guarantee interactive processing • E.g., ~10’s sec just to scan 512 GB RAM! • Gap between memory capacity and transfer rate increasing MLBase Spark Streaming BlinkDB GraphX doubles every 18 months Shark MLlib 512GB Spark doubles every 36 months HDFS, S3, … Tachyon 40-60GB/s 16 cores
Key Insights Mesos • Input often noisy:exact computations do not guarantee exact answers • Error often acceptable if small and bounded • Main challenge: estimate errors for arbitrary computations • Alpha release (August, 2013) • Allow users to build uniform and stratified samples • Provide error bounds for simple aggregate queries MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon
GraphX Mesos • Combine data-parallel and graph-parallel computations • Provide powerful abstractions: • PowerGraph, Pregel implemented in less than 20 LOC! • Leverage Spark’s fault tolerance • Alpha release: expected this fall MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon
MLlib and MLbase Mesos • MLlib: high quality library for ML algorithms • Will be released with Spark 0.8 (September, 2013) • MLbase: make ML accessible to non-experts • Declarative interface: allow users to say what they want • E.g., classify(data) • Automatically pick best algorithm for given data, time • Allow developers to easily add and test new algorithms • Alpha release of MLI, first component of MLbase, in September, 2013 MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon
Tachyon Mesos • In-memory, fault-tolerant storage system • Flexible API, including HDFS API • Allow multiple frameworks (including Hadoop) to share in-memory data • Alpha release (June, 2013) MLBase Spark Streaming BlinkDB GraphX Shark MLlib Spark HDFS, S3, … Tachyon
Compatibility to Existing Ecosystem Acceptinputs from Kafka, Flume, Twitter, TCP Sockets, … GraphLab API MLBase Spark Streaming BlinkDB GraphX Shark SQL MLlib Hive API Spark Mesos Resource Management Layer Support Hadoop, Storm, MPI HDFS, S3, … Tachyon Storage Layer HDFS API
Summary • BDAS: address next Big Data challenges • Unify batch, interactive, and streaming computations • Easy to develop sophisticate applications • Support graph & ML algorithms, approximate queries • Witnessed significant adoption • 20+ companies, 70+ individuals contributing code • Exciting ongoing work • MLbase, GraphX, BlinkDB, … Spark Batch Interactive Streaming
This Talk • Introduction to Shark, Spark and Spark Streaming • Architecture • Deployment Methodology • Performance • References
RDDs • Three methods for creation • Parallelizing an existing collection • Referencing a dataset • From another RDD • Dataset from any storage supported by Hadoop • HDFS • Cassandra • HBase • Amazon S3 • Others • File types supported • Text files • SequenceFiles • Hadoop InputFormat
Scala and Python • Spark comes with two shells • Scala • Python • APIs available for Scala, Python and Java • Appropriate versions for each Spark release • Spark’s native language is Scala, more natural to write Spark applications using Scala. • This presentation will focus on code examples in Scala
Spark’s Scala and Python Shell • Powerful tool to analyze data interactively • The Scala shell runs on the Java VM • Can leverage existing Java libraries • Scala: • To launch the Scala shell (from Spark home directory): ./bin/spark-shell • To read in a text file: scala> valtextFile = sc.textFile("README.txt") • Python: • To launch the Python shell (from Spark home directory): ./bin/pyspark • To read in a text file: >>> textFile = sc.textFile("README.txt")
Scala • ‘Scalable Language’ • Object oriented, functional programming language • Runs in a JVM • Java Interoperability • Functions are passable objects • Two approaches • Anonymous function syntax x => x + 1 • Static methods in a global singleton object object MyFunctions { def func1 (s: String): String = {…} } myRdd.map(MyFunctions.func1)
Code Execution (1) • ‘spark-shell’ provides Spark context as ‘sc’ // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() File: sparkQuotes.txt DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible
Code Execution (2) // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() RDD: quotes File: sparkQuotes.txt DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible
Code Execution (3) // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() RDD: danQuotes RDD: quotes File: sparkQuotes.txt DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible
Code Execution (4) // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() RDD: danQuotes RDD: quotes File: sparkQuotes.txt RDD: danSpark DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible
Code Execution (5) // Create RDD valquotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations valdanQuotes = quotes.filter(_.startsWith("DAN")) valdanSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() RDD: danQuotes RDD: quotes File: sparkQuotes.txt RDD: danSpark HadoopRDD DAN Spark is cool DAN Scala is awesome DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible Spark Scala 1
RDD Transformations • Transformations are lazy evaluations • Returns a pointer to the transformed RDD Full documentation at http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.package
RDD Actions • Actions returns values Full documentation at http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.package
RDD Persistence • Each node stores any partitions of the cache that it computes in memory • Reuses them in other actions on that dataset (or datasets derived from it) • Future actions are much faster (often by more than 10x) • Two methods for RDD persistence: persist() and cache()
Quick Introduction to Data Frames • Experimental API introduced in Spark 1.3 • Distributed Collection of Data organized in Columns • Targeted at Python ecosystem • Equivalent to Tables in Databases or DataFrame in R/PYTHON • Much richer optimization than any other implementation of DF • Can be constructed from a wide variety of sources and APIs
Create a DataFrame valdf = sqlContext.jsonFile("/home/ned/attendees.json") df.show() df.printSchema() df.select ("First Name").show() df.select("First Name","Age").show() df.filter(df("age")>40).show() df.groupBy("age").count().show()
Create a DataFrame from an RDD case class attendees_class (first_name: String, last_name:String, age:Int) Val attendees=sc.textFile("/home/ned/attendees.csv").map(_.split(",")).map(p=>attendees_class(p(0),p(1),p(2).trim.toInt)).toDF() people.registerTempTable("attendees") valyoungppl=sqlContext.sql("select first_name,last_name from attendees where age <35") youngppl.map(t=>"Name: " +t(0)+ " " + t(1)).collect().foreach(println)
SparkContext in Applications • The main entry point for Spark functionality • Represents the connection to a Spark cluster • Create RDDs, accumulators, and broadcast variables on that cluster • In the Spark shell, the SparkContext, sc, is automatically initialized for you to use • In a Spark program, import some classes and implicit conversions into your program: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf
A Spark Standalone Application in Scala Import statements Transformations and Actions SparkConf and SparkContext
Running Standalone Applications • Define the dependencies • Scala - simple.sbt • Create the typical directory structure with the files • Create a JAR package containing the application’s code. • Scala: sbt package • Use spark-submit to run the program
Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter(func = _.contains(...)) map(func = _.split(...))
Which Language Should I Use? • Standalone programs can be written in any, but interactive shell is only Python & Scala • Python users: can do Python for both • Java users: consider learning Scala for shell Performance: Java & Scala are faster due to static typing, but Python is often fine
Variables: var x: Int = 7var x = 7 // type inferred val y = “hi”// read-only Scala Cheat Sheet Functions: def square(x: Int): Int = x*x def square(x: Int): Int = { x*x // last line returned} Collections and closures: valnums = Array(1, 2, 3) nums.map((x: Int) => x + 2) // {3,4,5}nums.map(x => x + 2) // samenums.map(_ + 2) // same nums.reduce((x, y) => x + y) // 6nums.reduce(_ + _) // same Java interop: importjava.net.URL new URL(“http://cnn.com”).openStream() More details: scala-lang.org
Spark in Scala and Java // Scala: val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();
This Talk • Introduction to Shark, Spark and Spark Streaming • Architecture • Deployment Methodology • Performance • References
Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency • Tested with 100 text streams on 100 EC2 instances with 4 cores each
Performance and Generality(Unified Computation Models) Streaming (SparkStreaming) Batch (ML, Spark) Interactive (SQL, Shark)
Example: Video Quality Diagnosis Top 10 worse performers identical! 440x faster! Latency: 772.34 sec (17TB input) Latency: 1.78 sec (1.7GB input)
This Talk • Introduction to Shark, Spark and Spark Streaming • Architecture • Deployment Methodology • Implementation Next Steps • References
https://amplab.cs.Berkeley.edu/software • www.bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals/ • www.ibm.com/analytics/us/en/technology/spark/ • https://amplab.cs.berkeley.edu/software/