520 likes | 705 Views
Introduction to Apache Spark. Nicolas A Perez Software Engineer at IPC MapR Certified Spark Developer Organized Miami Scala Meetup https://twitter.com/@anicolaspp https://medium.com/@anicolaspp. What is Apache Spark. Spark. General Purpose Computing Framework faster than anything else.
E N D
Introduction to Apache Spark Nicolas A Perez Software Engineer at IPC MapR Certified Spark Developer Organized Miami Scala Meetup https://twitter.com/@anicolaspp https://medium.com/@anicolaspp
Spark • General Purpose Computing Framework faster than anything else • Used for large-scale data processing • Runs everywhere • Flexible (SQL, Streaming, GraphX, MLlib) • Easy to use (Scala, Java, Python, and R APIs)
General Purpose val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>val x =Math.random()val y =Math.random()if(x*x + y*y <1)1else0}.reduce(_+_)println("Pi is roughly "+4.0* count /NUM_SAMPLES)
The Spark Context (sc) • val conf =newSparkConf().setAppName(appName).setMaster(master)val sc = newSparkContext(conf) • val data =Array(1,2,3,4,5)val distData = sc.parallelize(data) The SparkContext class tells to Spark how to access to the cluster.
RDD • Resilient • Distributed • Datasets
Transformations on RDD map distinct join flatMap groupByKey filter reduceByKey union aggregateByKey intersection sortByKey
Actions on RDD reduce countByKey take collect saveAsTextFile takeSample first
RDD • RDD are created from any kind of sources (Text Files, HDFS, Raw Sockets, AWS S3, Azure Blob Storage, Cassandra, etc...) • RDD are lazy when calling Transformations on them. • RDD are represented by Spark as DAG (re-computation)
Deployment Platforms • Yarn • Mesos • AWS EMR • Azure HDInsight • Stand alone
Word Counting • val rdd = sc.textFile("path to the file")val counts: RDD[(String, Int)] = • rdd.flatMap(line => line.split(" ")).map(word =>(word,1)).reduceByKey(_+_)
Spark SQL • Allows us to represent our data in a tabular format • We can run SQL queries on it • We can easily integrate different sources and run parallel queries on all of them at once • We can use standard tools that use SQL to query any kind of data at scale
Built-in Data Sources • Json • Parquet • Text Files • ORC • SerDer • Hive Tables • JDBC
Third-Party Data Sources • Cassandra • Impala • Drill • CSV Files • Other Customs ( Read my blog to see how to implement your own )
Spark SQL Important Abstractions • Data Frames • Data Sets • val people = sqlContext.read.json(path).as[Person]
Strongly Typed Tabular Representation Uses Schema for data representation (typed schema) Encoder optimizations for faster data access Spark Data Set API
Spark Data Frames • val sc = new SparkContext(config) • val sql = new HiveContext(sc) • val transactionsDF = val df = sqlContext • .read • .format("com.nico.datasource.dat") • .load(“~/transactions/“) • transactionsDF.registerTempTable("some_table") • ———————————————————————— • More at: https://goo.gl/qKPJdi
Spark Streaming • StreamingContext • Built-in File Streaming, Raw Socket Streaming. • Libraries for Twitter, Kafka, AWS Kinesis, Flume, etc… • Can be extended to stream from any source • Batch Processing (micro batches) • Streams can look back to the future • Windowed Operations • stream.countByWindow(Seconds(20))
What you can get from Spark Streaming? • Millions of events per seconds (Billions if right deployment) • Concise API that is used in any other component of Spark • Fault Tolerance • Exactly-one semantic out of the box (for DFS and Kafka) • Integration with Spark SQL, MLlib, GraphX
Be careful • Not everyone needs streaming • Processing time must be smaller than batch time (Back pressure) • You might get out-of-order data • Applications need fine tuning since they have to run all the time • You need to careful planning your deployment strategy
Gotchas • dstream.foreachRDD { rdd => • val connection = createNewConnection() • rdd.foreach { record => • connection.send(record) } • }
Gotchas dstream.foreachRDD { rdd => rdd.foreach { record => val connection = createNewConnection() connection.send(record) connection.close() } }
Gotchas • dstream.foreachRDD { rdd => • rdd.foreachPartition { partitionOfRecords => • val connection = createNewConnection() • partitionOfRecords.foreach(record => connection.send(record)) • connection.close() • } • }