Introduction to Apache Spark

Introduction to Apache Spark Nicolas A Perez Software Engineer at IPC MapR Certified Spark Developer Organized Miami Scala Meetup https://twitter.com/@anicolaspp https://medium.com/@anicolaspp

What is Apache Spark

Spark • General Purpose Computing Framework faster than anything else • Used for large-scale data processing • Runs everywhere • Flexible (SQL, Streaming, GraphX, MLlib) • Easy to use (Scala, Java, Python, and R APIs)

Daytona Grey Sort Contest

General Purpose val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>val x =Math.random()val y =Math.random()if(x*x + y*y <1)1else0}.reduce(_+_)println("Pi is roughly "+4.0* count /NUM_SAMPLES)

The Stack

The Spark Context (sc) • val conf =newSparkConf().setAppName(appName).setMaster(master)val sc = newSparkContext(conf) • val data =Array(1,2,3,4,5)val distData = sc.parallelize(data) The SparkContext class tells to Spark how to access to the cluster.

RDD • Resilient • Distributed • Datasets

Transformations on RDD map distinct join flatMap groupByKey filter reduceByKey union aggregateByKey intersection sortByKey

Actions on RDD reduce countByKey take collect saveAsTextFile takeSample first

RDD Execution Graph - Logical View

Physical View

RDD • RDD are created from any kind of sources (Text Files, HDFS, Raw Sockets, AWS S3, Azure Blob Storage, Cassandra, etc...) • RDD are lazy when calling Transformations on them. • RDD are represented by Spark as DAG (re-computation)

Deployment Platforms • Yarn • Mesos • AWS EMR • Azure HDInsight • Stand alone

The ABC Example

Word Counting • val rdd = sc.textFile("path to the file")val counts: RDD[(String, Int)] = • rdd.flatMap(line => line.split(" ")).map(word =>(word,1)).reduceByKey(_+_)

DEMO

But there is a word counting on Hadoop!

Spark SQL • Allows us to represent our data in a tabular format • We can run SQL queries on it • We can easily integrate different sources and run parallel queries on all of them at once • We can use standard tools that use SQL to query any kind of data at scale

Built-in Data Sources • Json • Parquet • Text Files • ORC • SerDer • Hive Tables • JDBC

Third-Party Data Sources • Cassandra • Impala • Drill • CSV Files • Other Customs ( Read my blog to see how to implement your own )

Spark SQL Important Abstractions • Data Frames • Data Sets • val people = sqlContext.read.json(path).as[Person]

Strongly Typed Tabular Representation Uses Schema for data representation (typed schema) Encoder optimizations for faster data access Spark Data Set API

Spark SQL, a Distributed Query Engine

Spark Data Frames • val sc = new SparkContext(config) • val sql = new HiveContext(sc) • val transactionsDF = val df = sqlContext • .read • .format("com.nico.datasource.dat") • .load(“~/transactions/“) • transactionsDF.registerTempTable("some_table") • ———————————————————————— • More at: https://goo.gl/qKPJdi

Spark Streaming • StreamingContext • Built-in File Streaming, Raw Socket Streaming. • Libraries for Twitter, Kafka, AWS Kinesis, Flume, etc… • Can be extended to stream from any source • Batch Processing (micro batches) • Streams can look back to the future • Windowed Operations • stream.countByWindow(Seconds(20))

Streaming Architecture Overview

Streaming Internals

From DStream to RDD

What you can get from Spark Streaming? • Millions of events per seconds (Billions if right deployment) • Concise API that is used in any other component of Spark • Fault Tolerance • Exactly-one semantic out of the box (for DFS and Kafka) • Integration with Spark SQL, MLlib, GraphX

Be careful • Not everyone needs streaming • Processing time must be smaller than batch time (Back pressure) • You might get out-of-order data • Applications need fine tuning since they have to run all the time • You need to careful planning your deployment strategy

Twitter Streaming Demo?

It is better if we create our own streaming server!

Demo

Gotchas • dstream.foreachRDD { rdd => • val connection = createNewConnection() • rdd.foreach { record => • connection.send(record) } • }

Gotchas dstream.foreachRDD { rdd => rdd.foreach { record => val connection = createNewConnection() connection.send(record) connection.close() } }

Gotchas • dstream.foreachRDD { rdd => • rdd.foreachPartition { partitionOfRecords => • val connection = createNewConnection() • partitionOfRecords.foreach(record => connection.send(record)) • connection.close() • } • }

MLlib & GraphX to be continued...

Question?

Introduction to Apache Spark