Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka

EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

What to expect?  Spark Overview  Hadoop Overview  Spark vs Hadoop  Why Spark Hadoop?  Using Hadoop With Spark  Use Case  Conclusion EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

What is Spark?  Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation Figure: Real Time Processing In Spark  Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance Serial Parallel  It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations Reduction in time Figure: Data Parallelism In Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Overview Spark is used in real-time processing Polyglot: Can be programmed in Scala, Java, Python and R Real time computation & low latency because of in-memory computation Lazy Evaluation: Delays evaluation till needed EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data. Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark. GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Features Simple programming layer provides powerful caching and disk persistence capabilities 100x faster than for large scale data processing vs Powerful Caching Speed Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Features Polyglot Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Use Cases Twitter Sentiment Analysis With Spark NYSE: Real Time Analysis of Stock Market Data Trending be campaigns and attract larger audience Topics to can Sentiment crisis service adjusting and target marketing helps in used create management, Banking: Credit Card Fraud Detection Genomic Sequencing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Hadoop Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion Master HDFS (Storage) MapReduce (Processing) Slaves Allows parallel processing of the data stored in HDFS Allows to dump any kind of data across the cluster Hadoop Cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Hadoop Ecosystem EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Hadoop Features In-built capability of integrating seamlessly with cloud based services Flexible with all kinds of data Flexibility Scalability Usage of commodity hardware minimizes the cost of ownership Hadoop infrastructure has in-built fault tolerance features Reliability Economical Features EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Hadoop Use Cases E-Commerce Data Analytics Politics: US Presidential Election Banking: Credit Card Fraud Detection Healthcare EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark vs Hadoop EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark vs Hadoop Use Cases For Real Time Analytics Healthcare Stock Market Telecommunications Banking Government Our Requirements: Process data in real-time Handle input from multiple sources Easy to use Faster processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark vs Hadoop Page Rank Performance 180 160 140  Spark runs upto 100x times faster than Hadoop.  The in-memory processing in Spark is what makes it faster than MapReduce.  Spark is not considered as a replacement but as an extension to Hadoop. Hadoop 120 Iteration Time (s) Basic Spark 100 80 Spark + Controlled Partitioning 60 40 20 0 The best case as per our chart is when Spark is used alongside Hadoop. Let us dive in and use Hadoop with Spark. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Why to use Spark with Hadoop? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Why Spark Hadoop? Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN. Storage Sources Input Data Spark Streaming CSV Resource Allocation Sequence File Input Data Spark YARN HDFS Avro Output Data Parquet MapReduce Optional Processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Using Hadoop with Spark Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework Spark applications can also be run on YARN (Hadoop NextGen) Spark can run on top of HDFS to leverage the distributed replicated storage MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing & EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

YARN Deployment With Spark YARN Cluster Mode YARN Client Mode  In YARN-Cluster mode, the Spark driver runs inside an application master process which is managed by YARN  In YARN-Client mode, the Spark driver runs in the client process  The application master is only used for requesting resources from YARN.  The client can go away after initiating the application Figure: Cluster Deployment Mode Figure: Client Deployment Mode EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Sports Analysis Using Spark Hadoop EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case Problem Statement To build a Sport Analysis system using Spark Hadoop for predicting game results and player rankings for sports like Basketball, Football, Cricket, Soccer, etc. We will demonstrate the same using Basketball for our use case. Stephen Curry, NBA MVP 2015 & 2016 Kevin Durant, NBA MVP 2014 Joe Hassett, Highest 3 Pt Normalized LeBron James, NBA MVP ‘10, ’12 & ‘13 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Flow Diagram 1 2 3 Using Spark Processing for Analysis Data Stored in HDFS Huge amount of Sports data Query 1 4 Predict the NBA Most Valuable Player (MVP) Query 2 4 Calculate Top Scorers Per Season Query 3 4 Compare Teams to Predict Winners 5 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Dataset Figure: Dataset from http://www.basketball-reference.com/leagues/NBA_2016_per_game.html EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Initializing Spark Packages //Importing the necessary packages import org.apache.spark.rdd._ import org.apache.spark.rdd.RDD import org.apache.spark.util.IntParam import org.apache.spark.sql.SQLContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import org.apache.spark.util.StatCounter import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import org.apache.spark.mllib.linalg.{Vector, Vectors} import scala.collection.mutable.ListBuffer import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import scala.io.Source import scala.collection.mutable.HashMap import java.io.File EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Reading Data From HDFS //Creating an object basketball containing our main() class object basketball { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("basketball").setMaster("local[2]") val sc = new SparkContext(sparkConf) for (i <- 1980 to 2016) { println(i) val yearStats = sc.textFile(s"hdfs://localhost:9000/basketball/BasketballStats/leagues_NBA_$i*") yearStats.filter(x => x.contains(",")).map(x => (i,x)).saveAsTextFile(s"hdfs://localhost:9000/basketball/BasketballStatsWithYear/ $i/") } EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Parsing Data And Broadcasting //Read in all the statistics val stats=sc.textFile("hdfs://localhost:9000/basketball/BasketballStatsWithYear4/*/*") .repartition(sc.defaultParallelism) //Filter out the junk rows and clean up data for errors val filteredStats=stats.filter(line => !line.contains("FG%")).filter(line => line.contains(",")).map(line => line.replace("*","").replace(",,",",0,")) filteredStats.cache() //Parse statistics and save as Map val txtStat = Array("FG","FGA","FG%","3P","3PA","3P%","2P","2PA","2P%","eFG%","FT","FTA","FT%"," ORB","DRB","TRB","AST","STL","BLK","TOV","PF","PTS") val aggStats = processStats(filteredStats,txtStat).collectAsMap //Collect RDD into map and broadcast it into 'broadcastStats' val broadcastStats = sc.broadcast(aggStats) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Player Statistics Transformations //Parse stats and normalize val nStats = filteredStats.map(x=>bbParse(x,broadcastStats.value,zBroadcastStats.value)) //Parse stats and track weights val txtStatZ = Array("FG","FT","3P","TRB","AST","STL","BLK","TOV","PTS") val zStats = processStats(filteredStats,txtStatZ,broadcastStats.value).collectAsMap //Collect RDD into Map and broadcast into 'zBroadcastStats' val zBroadcastStats = sc.broadcast(zStats) //Map RDD to RDD[Row] so that we can turn it into a DataFrame val nPlayer = nStats.map(x => Row.fromSeq(Array(x.name,x.year,x.age,x.position,x.team,x.gp,x.gs,x.mp) ++ x.stats ++ x.statsZ ++ Array(x.valueZ) ++ x.statsN ++ Array(x.valueN))) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Querying through Spark SQL EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Getting All Player Statistics //create schema for the data frame val schemaN = StructType( StructField("name", StringType, true) :: StructField("year", IntegerType, true) :: ... StructField("nTOT", DoubleType, true) :: Nil ) //Create DataFrame 'dfPlayersT' and register as 'tPlayers' val sqlContext = new org.apache.spark.sql.SQLContext(sc) val dfPlayersT = sqlContext.createDataFrame(nPlayer,schemaN) dfPlayersT.registerTempTable("tPlayers") //Create DataFrame 'dfPlayers' and register as 'Players' val dfPlayers = sqlContext.sql("select age-min_age as exp,tPlayers.* from tPlayers join (select name,min(age)as min_age from tPlayers group by name) as t1 on tPlayers.name=t1.name order by tPlayers.name, exp ") dfPlayers.registerTempTable("Players") EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Storing Best Players Into HDFS //Calculate the best players of 2016 val mvp = sqlContext.sql("Select name, zTot from Players where year=2016 order by zTot desc").cache mvp.show //Storing the best players of 2016 into HDFS mvp.write.format("csv").save("hdfs://localhost:9000/basketball/output.csv") //Listing the full numbers of LeBron James sqlContext.sql("Select * from Players where year=2016 and name='LeBron James'").collect.foreach(println) //Ranking the top 10 players on the average 3 pointers scored per game in 2016 sqlContext.sql("select name, 3p, z3p from Players where year=2016 order by z3p desc").take(10).foreach(println) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case –Storing Best Players Into HDFS Best Player Of 2016 Most 3 Pointers In 2016 All Stats Of LeBron James EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Sample Result File in HDFS Output directory path Figure: Output file containing top NBA players of 2016 Figure: Output directory in HDFS file system EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Highest 3 Point Shooters //All time 3 point shooting ranking sqlContext.sql("select name, 3p, z3p from Players order by 3p desc").take(10).foreach(println) //All time 3 point shooting ranking normalized to their leagues sqlContext.sql("select name, 3p, z3p from Players order by z3p desc").take(10).foreach(println) //Calculate the average number of 3 pointers per game in 2016 broadcastStats.value("2016_3P_avg") //Calculate the average number of 3 pointers per game in 1981 broadcastStats.value("1981_3P_avg") EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Highest 3 Point Shooters Best All Time 3 Point Shooter Best All Time 3 Point Shooter Normalized To Their Season EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Prediction Analysis Results EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Who Will Be The 2016 NBA MVP? sqlContext.sql("select name, zTot from Players where year=2016 order by zTot desc").take(10).foreach(println) LeBron James James Harden Dwayne Wade Kobe Bryant Russell Westbrook Stephen Curry EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Predicting MVP 2016 As our model predicts, Stephen Curry is the MVP of NBA in 2016. Hell Yeah! It matched with the actual NBA MVP of 2016. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Summary EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Summary Spark vs Hadoop Spark Overview Hadoop Overview Sport Analysis Why Spark Hadoop? YARN Spark Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Conclusion EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Conclusion Congrats! We have hence demonstrated the power of Spark Hadoop in Prediction Analytics. The hands-on examples will give you the required confidence to work on any future projects you encounter in Apache Spark and Hadoop. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Thank You … Questions/Queries/Feedback EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka

Presentation Transcript

Using Apache Spark

Spark

Why Spark on Hadoop Matters

Spark

Spark

Introduction to Apache Spark

Spark in the Hadoop Ecosystem

Hadoop vs Apache Spark

Prwatech: Spark Training and Hadoop:

Apache spark training institute

Apache Spark Training | Best Spark Online Training-GOT

Apache Spark

Apache Spark Training | Best Spark Online Training-GOT

Spark over Hadoop

Introduction to Apache Spark

Apache Spark - Introduction

Introduction to Apache Spark

Apache Spark

Apache spark tutorial in Big data hadoop

What is the Difference between Hadoop and Apache spark