960 likes | 1.63k Views
This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:<br><br>1) Limitations of Apache Hive<br>2) Spark SQL Advantages Over Hive<br>3) Spark SQL Success Story<br>4) Spark SQL Features<br>5) Architecture of Spark SQL<br>6) Spark SQL Libraries<br>7) Querying Using Spark SQL<br>8) Demo: Stock Market Analysis With Spark SQL
E N D
5 Best Practices in DevOps Culture www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
What to expect? 2 Spark SQL Libraries 1 Why Spark SQL 3 Spark SQL Features 5 4 Use Case Hands-On Examples www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Why Spark SQL? www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Why Do We Need Spark SQL? Spark SQL was built to overcome the limitations of Apache Hive running on top of Spark. Limitations of Apache Hive Hive uses MapReduce which lags in performance with medium and small sized datasets ( <200 GB) No resume capability Hive cannot drop encrypted databases www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Advantages Over Hive 1 600 secs Faster execution 50 secs 2 Advantages No migration hurdles 3 Real time querying Batch Spark SQL uses the metastore services of Hive to query the data stored and managed by Hive. How? www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Success Story www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Success Story Twitter Sentiment Analysis With Spark SQL NYSE: Real Time Analysis of Stock Market Data Sentiment helps in crisis management, service adjusting and target marketing Trending Topics can be used to create campaigns and attract larger audience Banking: Credit Card Fraud Detection Genomic Sequencing www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features SQL Integration With Spark Uniform Data Access Seamless Support Transformations Performance Standard Connectivity User Defined Functions www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features 1 Spark SQL integrates relational processing with Spark’s functional programming. 2 Spark SQL is used for the structured/semi structured data analysis in Spark. www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features Support for various data formats 3 4 SQL queries can be converted into RDDs for transformations RDD 2 RDD 1 Shuffle transform Drop split point Invoking RDD 2 computes all partitions of RDD 1 www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Overview 5 Performance And Scalability www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features Standard JDBC/ODBC Connectivity 6 User Defined Functions lets users define new Column-based functions to extend the Spark vocabulary 7 User www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
UDF Example Creating a UDF ‘toUpperCase’ to convert a string to upper case Registering our UDF in the list of functions www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Architecture www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Architecture Architecture Of Spark SQL Spark SQL & HQL DataFrame DSL DataFrame API Data Source API CSV JSON JDBC www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Libraries www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Libraries Spark SQL has the following libraries: 1 Data Source API DataFrame API 2 Interpreter & Optimizer 3 4 SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Data Source API Data Source API DataFrame API Interpreter & Optimizer SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Data Source API Data Source API Data Source API is used to read and store structured and semi- structured data into Spark SQL Features: Structured/ Semi-structured data Multiple formats 3rdparty integration www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
DataFrame API Data Source API DataFrame API Interpreter & Optimizer SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
DataFrame API DataFrame API converts the data that is read through Data Source API into tabular columns to help perform SQL operations Data Source API Features: Distributed collection of data organized into named columns Equivalent to a relational table in SQL DataFrame API Lazily evaluated Named Columns www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
SQL Interpreter & Optimizer Data Source API DataFrame API Interpreter & Optimizer SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
SQL Interpreter & Optimizer SQL Interpreter & Optimizer handles the functional programming part of Spark SQL. It transforms the DataFrames RDDs to get the required results in the required formats. Features: Functional programming Transforming trees Faster than RDDs Processes all size data e.g. Catalyst: A modular library for distinct optimization Interpreter & Optimizer Resilient Distributed Dataset www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
SQL Service Data Source API DataFrame API Interpreter & Optimizer SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
SQL Service SQL Service is the entry point for working along structured data in Spark SQL is used to fetch the result from the interpreted & optimized data We have thus used all the four libraries in sequence. This completes a Spark SQL process Interpreter & Optimizer Spark SQL Service Resilient Distributed Dataset www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Querying Using Spark SQL Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Starting Up Spark Shell Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Starting Up Spark Shell - Intialization //We first import a Spark Session into Apache Spark. import org.apache.spark.sql.SparkSession //Creating a Spark Session ‘spark’ using the ‘builder()’ function. val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate() //Importing the Implicts class into our ‘spark’ Session. import spark.implicits._ //We now create a DataFrame ‘df’ and import data from the ’employee.json’ file. val df = spark.read.json("examples/src/main/resources/employee.json") //Displaying the DataFrame ‘df’. The result is a table of ages and names from our ’employee.json’ file. df.show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Starting Up Spark Shell – Spark Session www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Creating Datasets Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Creating Dataset - Case Class & Dataset After understanding DataFrames, let us now move on to Dataset API. The below code creates a Dataset class in SparkSQL. //Creating a class ‘Employee’ to store name and age of an employee. case class Employee(name: String, age: Long) //Assigning a Dataset ‘caseClassDS’ to store the record of Andrew. val caseClassDS = Seq(Employee("Andrew", 55)).toDS() //Displaying the Dataset ‘caseClassDS’. caseClassDS.show() //Creating a primitive Dataset to demonstrate mapping of DataFrames into Datasets. val primitiveDS = Seq(1, 2, 3).toDS() //Assigning the above sequence into an array. primitiveDS.map(_ + 1).collect() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Creating Dataset - Case Class & Dataset www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Creating Dataset – Reading File //Setting the path to our JSON file ’employee.json’. val path = "examples/src/main/resources/employee.json" //Creating a Dataset and from the file. val employeeDS = spark.read.json(path).as[Employee] //Displaying the contents of ’employeeDS’ Dataset. employeeDS.show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Creating Dataset – Reading File www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Adding Schema To RDDs Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Adding Schema To RDDs – Initialization //Importing Expression Encoder for RDDs, Encoder library and Implicts class into the shell. import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder import org.apache.spark.sql.Encoder import spark.implicits._ //Creating an ’employeeDF’ DataFrame from ’employee.txt’ and mapping the columns based on delimiter comma ‘,’ into a temporary view ’employee’. val employeeDF = spark.sparkContext.textFile("examples/src/main/resources/employee.txt").map(_.split(",")).ma p(attributes => Employee(attributes(0), attributes(1).trim.toInt)).toDF() //Creating the temporary view ’employee’. employeeDF.createOrReplaceTempView("employee") //Defining a DataFrame ‘youngstersDF’ which will contain all the employees between the ages of 18 and 30. val youngstersDF = spark.sql("SELECT name, age FROM employee WHERE age BETWEEN 18 AND 30") //Mapping the names from the RDD into ‘youngstersDF’ to display the names of youngsters. youngstersDF.map(youngster => "Name: " + youngster(0)).show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Adding Schema To RDDs – Initialization www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Adding Schema To RDDs - Transformation //Converting the mapped names into string for transformations. youngstersDF.map(youngster => "Name: " + youngster.getAs[String]("name")).show() //Using the mapEncoder from Implicits class to map the names to the ages. implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]] //Mapping the names to the ages of our ‘youngstersDF’ DataFrame. The result is an array with names mapped to their respective ages. youngstersDF.map(youngster => youngster.getValuesMap[Any](List("name", "age"))).collect() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Adding Schema To RDDs - Transformation www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Adding Schema – Reading File & Adding Schema //Importing the ‘types’ class into the Spark Shell. import org.apache.spark.sql.types._ //Importing ‘Row’ class into the Spark Shell. Row is used in mapping RDD Schema. import org.apache.spark.sql.Row //Creating a RDD ’employeeRDD’ from the text file ’employee.txt’. val employeeRDD = spark.sparkContext.textFile("examples/src/main/resources/employee.txt") //Defining the schema as “name age”. This is used to map the columns of the RDD. val schemaString = "name age" //Defining ‘fields’ RDD which will be the output after mapping the ’employeeRDD’ to the schema ‘schemaString’. val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true)) //Obtaining the type of ‘fields’ RDD into ‘schema’. val schema = StructType(fields) www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Adding Schema – Reading File & Adding Schema www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Adding Schema – Transformation Result //We now create a RDD called ‘rowRDD’ and transform the ’employeeRDD’ using the ‘map’ function into ‘rowRDD’. val rowRDD = employeeRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim)) //We define a DataFrame ’employeeDF’ and store the RDD schema into it. val employeeDF = spark.createDataFrame(rowRDD, schema) //Creating a temporary view of ’employeeDF’ into ’employee’. employeeDF.createOrReplaceTempView("employee") //Performing the SQL operation on ’employee’ to display the contents of employee. val results = spark.sql("SELECT name FROM employee") //Displaying the names of the previous operation from the ’employee’ view. results.map(attributes => "Name: " + attributes(0)).show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
Adding Schema – Transformation Result www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
JSON Dataset Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
JSON Data – Loading File //Importing Implicits class into the shell. import spark.implicits._ //Creating an ’employeeDF’ DataFrame from our ’employee.json’ file. val employeeDF = spark.read.json("examples/src/main/resources/employee.json") www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
JSON Data – Loading File www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
JSON Data – Parquet File //Creating a ‘parquetFile’ temporary view of our DataFrame. employeeDF.write.parquet("employee.parquet") val parquetFileDF = spark.read.parquet("employee.parquet") parquetFileDF.createOrReplaceTempView("parquetFile") //Selecting the names of people between the ages of 18 and 30 from our Parquet file. val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 18 AND 30") //Displaying the result of the Spark SQL operation. namesDF.map(attributes => "Name: " + attributes(0)).show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
JSON Data – Parquet File www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
JSON Dataset – Creating DataFrame //Setting to path to our ’employee.json’ file. val path = "examples/src/main/resources/employee.json" //Creating a DataFrame ’employeeDF’ from our JSON file. val employeeDF = spark.read.json(path) //Printing the schema of ’employeeDF’. employeeDF.printSchema() //Creating a temporary view of the DataFrame into ’employee’. employeeDF.createOrReplaceTempView("employee") //Defining a DataFrame ‘youngsterNamesDF’ which stores the names of all the employees between the ages of 18 and 30 present in ’employee’. val youngsterNamesDF = spark.sql("SELECT name FROM employee WHERE age BETWEEN 18 AND 30") //Displaying the contents of our DataFrame. youngsterNamesDF.show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING
JSON Dataset – Creating DataFrame www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING