480 likes | 645 Views
we will see an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed. Afterward, will cover all fundamental of Spark components. Furthermore, we will learn about Sparku2019s core abstraction and Spark RDD. For more detailed insights, we will also cover spark features, Spark limitations, and Spark Use cases. <br>https://data-flair.training/blogs/spark-tutorial/<br>
E N D
Introduction to Apache Spark Certified Apache Spark and Scala Training – DataFlair
Agenda Before Spark Need for Spark What is Apache Spark ? Goals Why Spark ? RDD & its Operations Features Of Spark Certified Apache Spark and Scala Training – DataFlair
Before Spark Batch Processing Stream Processing Interactive Processing Graph Processing Machine Learning Certified Apache Spark and Scala Training – DataFlair
Need For Spark • Need for a powerful engine that can process the data in Real-Time (streaming) as well as in Batch mode Need for a powerful engine that can respond in Sub-second and perform In-memory analytics Need for a powerful engine that can handle diverse workloads: – Batch – Streaming – Interactive – Graph – Machine Learning • • Certified Apache Spark and Scala Training – DataFlair
What is Apache Spark? Apache Spark is a powerful open source engine which can handle: – Batch processing – Real-time (stream) – Interactive – Graph – Machine Learning (Iterative) – In-memory Certified Apache Spark and Scala Training – DataFlair
Introduction to Apache Spark Lightening fast cluster computing tool General purpose distributed system Provides APIs in Scala, Java, Python, and R Certified Apache Spark and Scala Training – DataFlair
History Became Top-level project Open Sourced Most active project at Apache Donated to Apache World record in sorting Introduced by UC Berkeley 2009 2010 2011 2012 2013 2014 2015 Certified Apache Spark and Scala Training – DataFlair
Sort Record 2100 Nodes Hadoop-MapReduce 72 min 206 Nodes Spark 23 min Hadoop MapReduce Spark Data Size 102.5 TB 100 TB Time Taken 72 min 23 min No of nodes 2100 206 No of cores 50400 physical 6592 virtualized Cluster disk throughput 3150 GBPS 618 GBPS Network Dedicated 10 Gbps Virtualized 10 Gbps Src: Databricks Certified Apache Spark and Scala Training – DataFlair
Goals Easy to combine batch, streaming, and interactive computations Batch One Stack to Rule them all Interactive Streaming Certified Apache Spark and Scala Training – DataFlair
Goals Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Certified Apache Spark and Scala Training – DataFlair
Goals Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatiblewith existing open source ecosystem Certified Apache Spark and Scala Training – DataFlair
Why Spark ? 100x faster than Hadoop. Certified Apache Spark and Scala Training – DataFlair
Why Spark ? 100x faster than Hadoop. In-memory computation. Operation1 Operation1 Operation2 Operation1 … … Disk Disk Certified Apache Spark and Scala Training – DataFlair
Why Spark ? 100x faster than Hadoop. In-memory computation. … Operation 1 Operation 2 Operation n Disk Disk Disk Disk … Operation n Operation 1 Operation 2 Disk Disk Certified Apache Spark and Scala Training – DataFlair
Why Spark ? 100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Certified Apache Spark and Scala Training – DataFlair
Why Spark ? 100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Batches of Processed data Batches of Input data Input data stream Spark Engine Spark Streaming Certified Apache Spark and Scala Training – DataFlair
Why Spark ? 100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Certified Apache Spark and Scala Training – DataFlair
Why Spark ? 100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Action (collect) Transformation 1 map() Transformation 2 filter() RDD1 RDD2 RDD3 Result Certified Apache Spark and Scala Training – DataFlair
Why Spark ? 100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Compatible with hadoop, can process existing hadoop data. Certified Apache Spark and Scala Training – DataFlair
Spark Architecture Certified Apache Spark and Scala Training – DataFlair
Spark Nodes Nodes Master Node Slave Nodes Master Worker Certified Apache Spark and Scala Training – DataFlair
Basic Spark Architecture Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD) RDD is a simple and immutable collection of objects. Obj1 Obj2 Obj3 .... Obj n RDD Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD) RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects. RDD Objects Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD) RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects. Each RDD is split-up into different partitions, which may be computed on different nodes of clusters. RDD Partition1 Partition1 Partition2 Partition2 Partition3 Partition3 Partition4 Partition4 Partition5 Partition5 Partition6 Partition6 Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD) B2 B12 B1 Partition-1 Partition-2 Partition-3 Partition-4 Partition-5 . . . B5 B3 B4 B9 B10 B7 B11 B6 Create RDD B8 Employee-data.txt RDD Hadoop Cluster Certified Apache Spark and Scala Training – DataFlair
RDD Operations RDD Operations Transformations Actions Persistence Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Transformation Transformation: Set of operations that define how RDD should be transformed Creates a new RDD from the existing one to process the data Lazy evaluation: Computation doesn’t start until an action associated E.g. Map, FlatMap, Filter, Union, GroupBy, etc. Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Action Action: Triggers job execution. Returns the result or write it to the storage. E.g. Count, Collect, Reduce, Take, etc. Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Persistence Persistence: Spark allows caching/Persisting entire dataset in memory Caches the RDD in the memory for future operations Cache Primary Storage Certified Apache Spark and Scala Training – DataFlair
RDD Operations Creates a new RDD based on custom business logic Parent RDD (map(), flatMap()…) Transformations RDD RDD Returns output to Driver or exports data to storage system after computation Lineage Actions (saveAsTextFile(), count()…) Result Certified Apache Spark and Scala Training – DataFlair
Features of Spark Process every record exactly once 100 X Faster Than Hadoop Duplicate Elimination Speed Automatic Memory Management Diverse processing platform Memory Management Processing Window Criteria Fault Tolerance Recovers Automatically Time based window criteria Certified Apache Spark and Scala Training – DataFlair
Thank You DataFlair /DataFlairWS /c/DataFlairWS Certified Apache Spark and Scala Training – DataFlair