750 likes | 1.18k Views
This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial: <br><br>1) Big Data Analytics <br>2) What is Apache Spark? <br>3) Why Apache Spark? <br>4) Using Spark with Hadoop <br>5) Apache Spark Features <br>6) Apache Spark Architecture <br>7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX <br>8) Demo: Analyze Flight Data Using Apache Spark
E N D
5 Best Practices in DevOps Culture EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What to expect? 2 Spark Features 1 Why Apache Spark? 3 Spark Ecosystem 5 4 Use Case Hands-On Examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Big Data Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Data Generated Every Minute! EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Big Data Analytics ➢ Big Data Analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information ➢ Big Data Analytics is of two types: 1. Batch Analytics 2. Real-Time Analytics Batch Analytics Real Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark For Real Time Analysis Use Cases For Real Time Analytics Healthcare Stock Market Telecommunications Banking Government Our Requirements: Process data in real-time Handle input from multiple sources Easy to use Faster processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What Is Spark? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What Is Spark? Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation Figure: Real Time Processing In Spark Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance Serial Parallel Reduction in time It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations Figure: Data Parallelism In Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Why Spark? Simple programming layer provides powerful caching and disk persistence capabilities 100x faster than for large scale data processing vs Powerful Caching Speed Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Features Polyglot Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Success Story EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Success Story NYSE: Real Time Analysis of Stock Market Data Twitter Sentiment Analysis With Spark Trending be campaigns and attract larger audience Topics to can Sentiment crisis service adjusting and target marketing helps in used create management, Banking: Credit Card Fraud Detection Genomic Sequencing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Using Hadoop Through Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark And Hadoop Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework Spark applications can also be run on YARN (Hadoop NextGen) Spark can run on top of HDFS to leverage the distributed replicated storage MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing & EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Features Speed Multiple Languages Advanced Analytics Real Time Hadoop Integration Machine Learning EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Features vs Spark runs upto 100x times faster than MapReduce Supports multiple data sources Real time computation & low latency because of in-memory computation Lazy Evaluation: Delays evaluation till needed EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Features Hadoop Integration Machine Learning for iterative tasks EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Components Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Components Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data. Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark. GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Components ML pipelines makes it easier to combine multiple algorithms or workflows Tabular data abstraction introduced by Spark SQL DataFrames ML Pipelines Spark Streaming (Streaming) MLlib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Core Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing Table Row It is responsible for: Row Memory management and fault recovery Scheduling, distributing and monitoring jobs on a cluster Interacting with storage systems Result Row Row Figure: Spark Core Job Cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Architecture Worker Node Cache Executor Task Task Driver Program Cluster Manager Spark Context Worker Node Cache Executor Task Task Figure: Components of a Spark cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming Spark Streaming is used for processing real-time streaming data It is a useful addition to the core Spark API Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data Figure: Streams In Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming MLlib Machine Learning Streaming Data Sources Data Storage Systems Spark Streaming Static Data Sources Spark SQL SQL + DataFrames Figure: Overview Of Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 flatMap Operation Words DStream Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Figure: Input data stream divided into discrete chunks of data Figure: Extracting words from an InputStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Features 1 Spark SQL integrates relational processing with Spark’s functional programming. 2 Spark SQL is used for the structured/semi structured data analysis in Spark. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Features 3 Support for various data formats RDD 2 RDD 1 Shuffle transform 4 Drop split point SQL queries can be converted into RDDs for transformations Invoking RDD 2 computes all partitions of RDD 1 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Overview 5 Performance And Scalability EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Features 6 Standard JDBC/ODBC Connectivity 7 User Defined Functions lets users define new Column-based functions to extend the Spark vocabulary User EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Flow Diagram Spark SQL has the following libraries: 1. Data Source API 2. DataFrame API 3. Interpreter & Optimizer 4. SQL Service Data Source API The flow diagram represents a Spark SQL process using all the four libraries in sequence DataFrame API Named Columns Interpreter & Optimizer Spark SQL Service Resilient Distributed Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
MLlib Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
MLlib Machine Learning may be broken down into two classes of algorithms: Machine Learning Supervised Unsupervised Clustering - K Means • Classification - Naïve Bayes - SVM • Dimensionality Reduction - Principal Component Analysis - SVD • Regression - Linear - Logistic • Supervised algorithms use labelled data in which both the input and output are provided to the algorithm Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make sense of the data without labels EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Mllib - Techniques There are 3 common techniques for Machine Learning: 1. Classification: It is a family of supervised machine learning algorithms that designate input as belonging to one of several pre-defined classes Some common use cases for classification include: i) Credit card fraud detection ii) Email spam detection 2. Clustering: In clustering, an algorithm groups objects into categories by analyzing similarities between input examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Mllib - Techniques Collaborative Filtering: Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part) 3. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
GraphX Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
GraphX Graph Concepts A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that connect them. The vertices are the objects and the edges are the relationships between them. Relationship: Friends John Sam Edge Vertex A directed graph is a graph where the edges have a direction associated with them. E.g. User Sam follows John on Twitter. Relationship: Friends John Sam Follows EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
GraphX – Triplet View A GraphX has Graph class that contains members to access edges and vertices Vertices: B Triplet View The triplet view logically joins the vertex and edge properties yielding an Edges: B A RDD[EdgeTriplet[VD, ED]] containing instances of the EdgeTriplet class Triplets: B A EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
GraphX – Property Graph GraphX is the Spark API for graphs and graph-parallel computation. GraphX extends the Spark RDD with a Resilient Distributed Property Graph. The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex has user defined properties associated with it. The parallel edges allow multiple relationships between the same vertices. Vertex Property LAX SJC Edge Property Property Graph EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
GraphX – Example To understand GraphX, let us consider the below graph. The vertices have names and ages of people. The edges represent whether a person likes a person and its weight is a measure of the likeability. 7 4 1 2 3 Charlie Age: 65 Alice Age: 28 Bob Age: 27 1 3 2 3 4 5 6 David Age: 42 Ed Fran Age: 50 Age: 55 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
GraphX – Example Display names and ages val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD) graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect.foreach { case (id, (name, age)) => println(s"$name is $age")} Output David is 42 Fran is 50 Ed is 55 Charlie is 65 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
GraphX – Example Display Relations for (triplet <- graph.triplets.collect) { println(s"${triplet.srcAttr._1} likes ${triplet.dstAttr._1}") } Output Bob likes Alice Bob likes David Charlie likes Bob Charlie likes Fran David likes Alice Ed likes Bob Ed likes Charlie Ed likes Fran EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case: Analyze Flight Data Using Spark GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case: Problem Statement Problem Statement To analyse Real-Time Flight data using Spark GraphX, provide near real-time computation results and visualize the results using Google Data Studio Computations to be done: Compute the total number of flight routes Compute and sort the longest flight routes Display the airport with the highest degree vertex List the most important airports according to PageRank List the routes with the lowest flight costs We will use Spark GraphX for the above computations and visualize the results using Google Data Studio EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case: Flight Dataset The attributes of each particular row is as below: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Scheduled Departure Time 11. Actual Departure Time 12. Departure Delay In Minutes 13. Scheduled Arrival Time 14. Actual Arrival Time 15. Arrival Delay Minutes 16. Elapsed Time 17. Distance Day Of Month Day Of Week Carrier Code Unique ID- Tail Number Flight Number Origin Airport ID Origin Airport Code Destination Airport ID Destination Airport Code Figure: USA Airport Flight Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case: Flow Diagram 1 2 3 Database storing Real-Time Flight Data Huge amount of Flight data Creating Graph Using GraphX Query 1 4 Compute Longest Flight Routes Query 2 4 Calculate Top Busiest Airports Query 3 4 Calculate Routes with Lowest Flight Costs 5 Visualizing using Google Data Studio USA Flight Mapping EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case: Starting Spark Shell //Importing the necessary classes import org.apache.spark._ import org.apache.spark.rdd.RDD import org.apache.spark.util.IntParam import org.apache.spark.graphx._ import org.apache.spark.graphx.util.GraphGenerators //Creating a Case Class ‘Flight’ case class Flight(dofM:String, dofW:String, carrier:String, tailnum:String, flnum:Int, org_id:Long, origin:String, dest_id:Long, dest:String, crsdeptime:Double, deptime:Double, depdelaymins:Double, crsarrtime:Double, arrtime:Double, arrdelay:Double,crselapsedtime:Double,dist:Int) //Defining a Parse String ‘parseFlight’ function to parse input into ‘Flight’ class def parseFlight(str: String): Flight = { val line = str.split(",") Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong, line(6), line(7).toLong, line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble, line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble, line(16).toInt) } EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case: Starting Spark Shell 1 2 3 4 5 6 7 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case: Creating Edges For Graph Mapping //Load the data into a RDD ‘textRDD’ val textRDD = sc.textFile("/home/edureka/Downloads/AirportDataset.csv") //Parse the RDD of CSV lines into an RDD of flight classes val flightsRDD = textRDD.map(parseFlight).cache() //Create airports RDD with ID and Name val airports = flightsRDD.map(flight => (flight.org_id, flight.origin)).distinct airports.take(1) //Defining a default vertex called ‘nowhere’ and mapping Airport ID for printlns val nowhere = "nowhere" val airportMap = airports.map { case ((org_id), name) => (org_id -> name) }.collect.toList.toMap //Create routes RDD with sourceID, destinationID and distance val routes = flightsRDD.map(flight => ((flight.org_id, flight.dest_id), flight.dist)).distinct routes.take(2) //Create edges RDD with sourceID, destinationID and distance val edges = routes.map { case ((org_id, dest_id), distance) => Edge(org_id.toLong, dest_id.toLong, distance)} edges.take(1) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training