1 / 46

Introduction to Apache Spark

Introduction to Apache Spark. Nicolas A Perez Software Engineer at IPC MapR Certified Spark Developer Organized Miami Scala Meetup https://twitter.com/@anicolaspp https://medium.com/@anicolaspp. What is Apache Spark. Spark. General Purpose Computing Framework faster than anything else.

efox
Download Presentation

Introduction to Apache Spark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Apache Spark Nicolas A Perez Software Engineer at IPC MapR Certified Spark Developer Organized Miami Scala Meetup https://twitter.com/@anicolaspp https://medium.com/@anicolaspp

  2. What is Apache Spark

  3. Spark • General Purpose Computing Framework faster than anything else • Used for large-scale data processing • Runs everywhere • Flexible (SQL, Streaming, GraphX, MLlib) • Easy to use (Scala, Java, Python, and R APIs)

  4. Daytona Grey Sort Contest

  5. General Purpose val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>val x =Math.random()val y =Math.random()if(x*x + y*y <1)1else0}.reduce(_+_)println("Pi is roughly "+4.0* count /NUM_SAMPLES)

  6. The Stack

  7. The Spark Context (sc) • val conf =newSparkConf().setAppName(appName).setMaster(master)val sc = newSparkContext(conf) • val data =Array(1,2,3,4,5)val distData = sc.parallelize(data) The SparkContext class tells to Spark how to access to the cluster.

  8. RDD • Resilient • Distributed • Datasets

  9. Transformations on RDD map distinct join flatMap groupByKey filter reduceByKey union aggregateByKey intersection sortByKey

  10. Actions on RDD reduce countByKey take collect saveAsTextFile takeSample first

  11. RDD Execution Graph - Logical View

  12. Physical View

  13. RDD • RDD are created from any kind of sources (Text Files, HDFS, Raw Sockets, AWS S3, Azure Blob Storage, Cassandra, etc...) • RDD are lazy when calling Transformations on them. • RDD are represented by Spark as DAG (re-computation)

  14. Deployment Platforms • Yarn • Mesos • AWS EMR • Azure HDInsight • Stand alone

  15. The ABC Example

  16. Word Counting • val rdd = sc.textFile("path to the file")val counts: RDD[(String, Int)] = • rdd.flatMap(line => line.split(" ")).map(word =>(word,1)).reduceByKey(_+_)

  17. DEMO

  18. But there is a word counting on Hadoop!

  19. Spark SQL • Allows us to represent our data in a tabular format • We can run SQL queries on it • We can easily integrate different sources and run parallel queries on all of them at once • We can use standard tools that use SQL to query any kind of data at scale

  20. Built-in Data Sources • Json • Parquet • Text Files • ORC • SerDer • Hive Tables • JDBC

  21. Third-Party Data Sources • Cassandra • Impala • Drill • CSV Files • Other Customs ( Read my blog to see how to implement your own )

  22. Spark SQL Important Abstractions • Data Frames • Data Sets • val people = sqlContext.read.json(path).as[Person]

  23. Strongly Typed Tabular Representation Uses Schema for data representation (typed schema) Encoder optimizations for faster data access Spark Data Set API

  24. Spark SQL, a Distributed Query Engine

  25. Spark Data Frames • val sc = new SparkContext(config) • val sql = new HiveContext(sc) • val transactionsDF = val df = sqlContext • .read • .format("com.nico.datasource.dat") • .load(“~/transactions/“) • transactionsDF.registerTempTable("some_table") • ———————————————————————— • More at: https://goo.gl/qKPJdi

  26. Spark Streaming • StreamingContext • Built-in File Streaming, Raw Socket Streaming. • Libraries for Twitter, Kafka, AWS Kinesis, Flume, etc… • Can be extended to stream from any source • Batch Processing (micro batches) • Streams can look back to the future • Windowed Operations • stream.countByWindow(Seconds(20))

  27. Streaming Architecture Overview

  28. Streaming Internals

  29. From DStream to RDD

  30. What you can get from Spark Streaming? • Millions of events per seconds (Billions if right deployment) • Concise API that is used in any other component of Spark • Fault Tolerance • Exactly-one semantic out of the box (for DFS and Kafka) • Integration with Spark SQL, MLlib, GraphX

  31. Be careful • Not everyone needs streaming • Processing time must be smaller than batch time (Back pressure) • You might get out-of-order data • Applications need fine tuning since they have to run all the time • You need to careful planning your deployment strategy

  32. Twitter Streaming Demo?

  33. It is better if we create our own streaming server!

  34. Demo

  35. Gotchas • dstream.foreachRDD { rdd => • val connection = createNewConnection() • rdd.foreach { record => • connection.send(record) } • }

  36. Gotchas dstream.foreachRDD { rdd => rdd.foreach { record => val connection = createNewConnection() connection.send(record) connection.close() } }

  37. Gotchas • dstream.foreachRDD { rdd => • rdd.foreachPartition { partitionOfRecords => • val connection = createNewConnection() • partitionOfRecords.foreach(record => connection.send(record)) • connection.close() • } • }

  38. MLlib & GraphX to be continued...

  39. Question?

More Related