Introduction to Apache Spark

Introduction to Apache Spark Certified Apache Spark and Scala Training – DataFlair

Agenda        Before Spark Need for Spark What is Apache Spark ? Goals Why Spark ? RDD & its Operations Features Of Spark Certified Apache Spark and Scala Training – DataFlair

Before Spark Batch Processing Stream Processing Interactive Processing Graph Processing Machine Learning Certified Apache Spark and Scala Training – DataFlair

Need For Spark • Need for a powerful engine that can process the data in Real-Time (streaming) as well as in Batch mode Need for a powerful engine that can respond in Sub-second and perform In-memory analytics Need for a powerful engine that can handle diverse workloads: – Batch – Streaming – Interactive – Graph – Machine Learning • • Certified Apache Spark and Scala Training – DataFlair

What is Apache Spark? Apache Spark is a powerful open source engine which can handle: – Batch processing – Real-time (stream) – Interactive – Graph – Machine Learning (Iterative) – In-memory Certified Apache Spark and Scala Training – DataFlair

Introduction to Apache Spark Lightening fast cluster computing tool  General purpose distributed system  Provides APIs in Scala, Java, Python, and R  Certified Apache Spark and Scala Training – DataFlair

History Became Top-level project Open Sourced Most active project at Apache Donated to Apache World record in sorting Introduced by UC Berkeley 2009 2010 2011 2012 2013 2014 2015 Certified Apache Spark and Scala Training – DataFlair

Sort Record 2100 Nodes Hadoop-MapReduce 72 min 206 Nodes Spark 23 min Hadoop MapReduce Spark Data Size 102.5 TB 100 TB Time Taken 72 min 23 min No of nodes 2100 206 No of cores 50400 physical 6592 virtualized Cluster disk throughput 3150 GBPS 618 GBPS Network Dedicated 10 Gbps Virtualized 10 Gbps Src: Databricks Certified Apache Spark and Scala Training – DataFlair

Goals  Easy to combine batch, streaming, and interactive computations Batch One Stack to Rule them all Interactive Streaming Certified Apache Spark and Scala Training – DataFlair

Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms Certified Apache Spark and Scala Training – DataFlair

Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms  Compatiblewith existing open source ecosystem Certified Apache Spark and Scala Training – DataFlair

Why Spark ?  100x faster than Hadoop. Certified Apache Spark and Scala Training – DataFlair

Why Spark ?   100x faster than Hadoop. In-memory computation. Operation1 Operation1 Operation2 Operation1 … … Disk Disk Certified Apache Spark and Scala Training – DataFlair

Why Spark ?   100x faster than Hadoop. In-memory computation. … Operation 1 Operation 2 Operation n Disk Disk Disk Disk … Operation n Operation 1 Operation 2 Disk Disk Certified Apache Spark and Scala Training – DataFlair

Why Spark ?    100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Certified Apache Spark and Scala Training – DataFlair

Why Spark ?     100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Batches of Processed data Batches of Input data Input data stream Spark Engine Spark Streaming Certified Apache Spark and Scala Training – DataFlair

Why Spark ?      100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Certified Apache Spark and Scala Training – DataFlair

Why Spark ?       100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Action (collect) Transformation 1 map() Transformation 2 filter() RDD1 RDD2 RDD3 Result Certified Apache Spark and Scala Training – DataFlair

Why Spark ?        100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Compatible with hadoop, can process existing hadoop data. Certified Apache Spark and Scala Training – DataFlair

Spark Architecture Certified Apache Spark and Scala Training – DataFlair

Spark Nodes Nodes Master Node Slave Nodes Master Worker Certified Apache Spark and Scala Training – DataFlair

Basic Spark Architecture Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD)  RDD is a simple and immutable collection of objects. Obj1 Obj2 Obj3 .... Obj n RDD Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD)   RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects. RDD Objects Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD)    RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects. Each RDD is split-up into different partitions, which may be computed on different nodes of clusters. RDD Partition1 Partition1 Partition2 Partition2 Partition3 Partition3 Partition4 Partition4 Partition5 Partition5 Partition6 Partition6 Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD) B2 B12 B1 Partition-1 Partition-2 Partition-3 Partition-4 Partition-5 . . . B5 B3 B4 B9 B10 B7 B11 B6 Create RDD B8 Employee-data.txt RDD Hadoop Cluster Certified Apache Spark and Scala Training – DataFlair

RDD Operations RDD Operations Transformations Actions Persistence Certified Apache Spark and Scala Training – DataFlair

RDD Operations – Transformation Transformation:  Set of operations that define how RDD should be transformed  Creates a new RDD from the existing one to process the data  Lazy evaluation: Computation doesn’t start until an action associated  E.g. Map, FlatMap, Filter, Union, GroupBy, etc. Certified Apache Spark and Scala Training – DataFlair

RDD Operations – Action Action:  Triggers job execution.  Returns the result or write it to the storage.  E.g. Count, Collect, Reduce, Take, etc. Certified Apache Spark and Scala Training – DataFlair

RDD Operations – Persistence Persistence:  Spark allows caching/Persisting entire dataset in memory  Caches the RDD in the memory for future operations Cache Primary Storage Certified Apache Spark and Scala Training – DataFlair

RDD Operations Creates a new RDD based on custom business logic Parent RDD (map(), flatMap()…) Transformations RDD RDD Returns output to Driver or exports data to storage system after computation Lineage Actions (saveAsTextFile(), count()…) Result Certified Apache Spark and Scala Training – DataFlair

Features of Spark Process every record exactly once 100 X Faster Than Hadoop Duplicate Elimination Speed Automatic Memory Management Diverse processing platform Memory Management Processing Window Criteria Fault Tolerance Recovers Automatically Time based window criteria Certified Apache Spark and Scala Training – DataFlair

Thank You DataFlair /DataFlairWS /c/DataFlairWS Certified Apache Spark and Scala Training – DataFlair

Introduction to Apache Spark