460 likes | 593 Views
This Edureka Spark Hadoop Tutorial will help you understand how to use Spark and Hadoop together. This Spark Hadoop tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:<br><br>1) Spark Overview<br>2) Hadoop Overview<br>3) Spark vs Hadoop<br>4) Why Spark Hadoop?<br>5) Using Hadoop With Spark<br>6) Use Case - Sports Analytics (NBA)
E N D
EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What to expect? Spark Overview Hadoop Overview Spark vs Hadoop Why Spark Hadoop? Using Hadoop With Spark Use Case Conclusion EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What is Spark? Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation Figure: Real Time Processing In Spark Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance Serial Parallel It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations Reduction in time Figure: Data Parallelism In Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Overview Spark is used in real-time processing Polyglot: Can be programmed in Scala, Java, Python and R Real time computation & low latency because of in-memory computation Lazy Evaluation: Delays evaluation till needed EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data. Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark. GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Features Simple programming layer provides powerful caching and disk persistence capabilities 100x faster than for large scale data processing vs Powerful Caching Speed Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Features Polyglot Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Use Cases Twitter Sentiment Analysis With Spark NYSE: Real Time Analysis of Stock Market Data Trending be campaigns and attract larger audience Topics to can Sentiment crisis service adjusting and target marketing helps in used create management, Banking: Credit Card Fraud Detection Genomic Sequencing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Hadoop Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion Master HDFS (Storage) MapReduce (Processing) Slaves Allows parallel processing of the data stored in HDFS Allows to dump any kind of data across the cluster Hadoop Cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Hadoop Ecosystem EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Hadoop Features In-built capability of integrating seamlessly with cloud based services Flexible with all kinds of data Flexibility Scalability Usage of commodity hardware minimizes the cost of ownership Hadoop infrastructure has in-built fault tolerance features Reliability Economical Features EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Hadoop Use Cases E-Commerce Data Analytics Politics: US Presidential Election Banking: Credit Card Fraud Detection Healthcare EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark vs Hadoop EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark vs Hadoop Use Cases For Real Time Analytics Healthcare Stock Market Telecommunications Banking Government Our Requirements: Process data in real-time Handle input from multiple sources Easy to use Faster processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark vs Hadoop Page Rank Performance 180 160 140 Spark runs upto 100x times faster than Hadoop. The in-memory processing in Spark is what makes it faster than MapReduce. Spark is not considered as a replacement but as an extension to Hadoop. Hadoop 120 Iteration Time (s) Basic Spark 100 80 Spark + Controlled Partitioning 60 40 20 0 The best case as per our chart is when Spark is used alongside Hadoop. Let us dive in and use Hadoop with Spark. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Why to use Spark with Hadoop? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Why Spark Hadoop? Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN. Storage Sources Input Data Spark Streaming CSV Resource Allocation Sequence File Input Data Spark YARN HDFS Avro Output Data Parquet MapReduce Optional Processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Using Hadoop with Spark Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework Spark applications can also be run on YARN (Hadoop NextGen) Spark can run on top of HDFS to leverage the distributed replicated storage MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing & EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
YARN Deployment With Spark YARN Cluster Mode YARN Client Mode In YARN-Cluster mode, the Spark driver runs inside an application master process which is managed by YARN In YARN-Client mode, the Spark driver runs in the client process The application master is only used for requesting resources from YARN. The client can go away after initiating the application Figure: Cluster Deployment Mode Figure: Client Deployment Mode EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Sports Analysis Using Spark Hadoop EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case Problem Statement To build a Sport Analysis system using Spark Hadoop for predicting game results and player rankings for sports like Basketball, Football, Cricket, Soccer, etc. We will demonstrate the same using Basketball for our use case. Stephen Curry, NBA MVP 2015 & 2016 Kevin Durant, NBA MVP 2014 Joe Hassett, Highest 3 Pt Normalized LeBron James, NBA MVP ‘10, ’12 & ‘13 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Flow Diagram 1 2 3 Using Spark Processing for Analysis Data Stored in HDFS Huge amount of Sports data Query 1 4 Predict the NBA Most Valuable Player (MVP) Query 2 4 Calculate Top Scorers Per Season Query 3 4 Compare Teams to Predict Winners 5 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Dataset Figure: Dataset from http://www.basketball-reference.com/leagues/NBA_2016_per_game.html EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Initializing Spark Packages //Importing the necessary packages import org.apache.spark.rdd._ import org.apache.spark.rdd.RDD import org.apache.spark.util.IntParam import org.apache.spark.sql.SQLContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import org.apache.spark.util.StatCounter import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import org.apache.spark.mllib.linalg.{Vector, Vectors} import scala.collection.mutable.ListBuffer import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import scala.io.Source import scala.collection.mutable.HashMap import java.io.File EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Reading Data From HDFS //Creating an object basketball containing our main() class object basketball { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("basketball").setMaster("local[2]") val sc = new SparkContext(sparkConf) for (i <- 1980 to 2016) { println(i) val yearStats = sc.textFile(s"hdfs://localhost:9000/basketball/BasketballStats/leagues_NBA_$i*") yearStats.filter(x => x.contains(",")).map(x => (i,x)).saveAsTextFile(s"hdfs://localhost:9000/basketball/BasketballStatsWithYear/ $i/") } EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Parsing Data And Broadcasting //Read in all the statistics val stats=sc.textFile("hdfs://localhost:9000/basketball/BasketballStatsWithYear4/*/*") .repartition(sc.defaultParallelism) //Filter out the junk rows and clean up data for errors val filteredStats=stats.filter(line => !line.contains("FG%")).filter(line => line.contains(",")).map(line => line.replace("*","").replace(",,",",0,")) filteredStats.cache() //Parse statistics and save as Map val txtStat = Array("FG","FGA","FG%","3P","3PA","3P%","2P","2PA","2P%","eFG%","FT","FTA","FT%"," ORB","DRB","TRB","AST","STL","BLK","TOV","PF","PTS") val aggStats = processStats(filteredStats,txtStat).collectAsMap //Collect RDD into map and broadcast it into 'broadcastStats' val broadcastStats = sc.broadcast(aggStats) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Player Statistics Transformations //Parse stats and normalize val nStats = filteredStats.map(x=>bbParse(x,broadcastStats.value,zBroadcastStats.value)) //Parse stats and track weights val txtStatZ = Array("FG","FT","3P","TRB","AST","STL","BLK","TOV","PTS") val zStats = processStats(filteredStats,txtStatZ,broadcastStats.value).collectAsMap //Collect RDD into Map and broadcast into 'zBroadcastStats' val zBroadcastStats = sc.broadcast(zStats) //Map RDD to RDD[Row] so that we can turn it into a DataFrame val nPlayer = nStats.map(x => Row.fromSeq(Array(x.name,x.year,x.age,x.position,x.team,x.gp,x.gs,x.mp) ++ x.stats ++ x.statsZ ++ Array(x.valueZ) ++ x.statsN ++ Array(x.valueN))) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Querying through Spark SQL EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Getting All Player Statistics //create schema for the data frame val schemaN = StructType( StructField("name", StringType, true) :: StructField("year", IntegerType, true) :: ... StructField("nTOT", DoubleType, true) :: Nil ) //Create DataFrame 'dfPlayersT' and register as 'tPlayers' val sqlContext = new org.apache.spark.sql.SQLContext(sc) val dfPlayersT = sqlContext.createDataFrame(nPlayer,schemaN) dfPlayersT.registerTempTable("tPlayers") //Create DataFrame 'dfPlayers' and register as 'Players' val dfPlayers = sqlContext.sql("select age-min_age as exp,tPlayers.* from tPlayers join (select name,min(age)as min_age from tPlayers group by name) as t1 on tPlayers.name=t1.name order by tPlayers.name, exp ") dfPlayers.registerTempTable("Players") EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Storing Best Players Into HDFS //Calculate the best players of 2016 val mvp = sqlContext.sql("Select name, zTot from Players where year=2016 order by zTot desc").cache mvp.show //Storing the best players of 2016 into HDFS mvp.write.format("csv").save("hdfs://localhost:9000/basketball/output.csv") //Listing the full numbers of LeBron James sqlContext.sql("Select * from Players where year=2016 and name='LeBron James'").collect.foreach(println) //Ranking the top 10 players on the average 3 pointers scored per game in 2016 sqlContext.sql("select name, 3p, z3p from Players where year=2016 order by z3p desc").take(10).foreach(println) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case –Storing Best Players Into HDFS Best Player Of 2016 Most 3 Pointers In 2016 All Stats Of LeBron James EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Sample Result File in HDFS Output directory path Figure: Output file containing top NBA players of 2016 Figure: Output directory in HDFS file system EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Highest 3 Point Shooters //All time 3 point shooting ranking sqlContext.sql("select name, 3p, z3p from Players order by 3p desc").take(10).foreach(println) //All time 3 point shooting ranking normalized to their leagues sqlContext.sql("select name, 3p, z3p from Players order by z3p desc").take(10).foreach(println) //Calculate the average number of 3 pointers per game in 2016 broadcastStats.value("2016_3P_avg") //Calculate the average number of 3 pointers per game in 1981 broadcastStats.value("1981_3P_avg") EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Highest 3 Point Shooters Best All Time 3 Point Shooter Best All Time 3 Point Shooter Normalized To Their Season EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Prediction Analysis Results EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Who Will Be The 2016 NBA MVP? sqlContext.sql("select name, zTot from Players where year=2016 order by zTot desc").take(10).foreach(println) LeBron James James Harden Dwayne Wade Kobe Bryant Russell Westbrook Stephen Curry EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Predicting MVP 2016 As our model predicts, Stephen Curry is the MVP of NBA in 2016. Hell Yeah! It matched with the actual NBA MVP of 2016. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Summary EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Summary Spark vs Hadoop Spark Overview Hadoop Overview Sport Analysis Why Spark Hadoop? YARN Spark Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Conclusion EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Conclusion Congrats! We have hence demonstrated the power of Spark Hadoop in Prediction Analytics. The hands-on examples will give you the required confidence to work on any future projects you encounter in Apache Spark and Hadoop. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Thank You … Questions/Queries/Feedback EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training