1 / 33

Introduction to Apache Spark

we will see an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed. Afterward, will cover all fundamental of Spark components. Furthermore, we will learn about Sparku2019s core abstraction and Spark RDD. For more detailed insights, we will also cover spark features, Spark limitations, and Spark Use cases. <br>https://data-flair.training/blogs/spark-tutorial/<br>

aakashdata
Download Presentation

Introduction to Apache Spark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Apache Spark Certified Apache Spark and Scala Training – DataFlair

  2. Agenda        Before Spark Need for Spark What is Apache Spark ? Goals Why Spark ? RDD & its Operations Features Of Spark Certified Apache Spark and Scala Training – DataFlair

  3. Before Spark Batch Processing Stream Processing Interactive Processing Graph Processing Machine Learning Certified Apache Spark and Scala Training – DataFlair

  4. Need For Spark • Need for a powerful engine that can process the data in Real-Time (streaming) as well as in Batch mode Need for a powerful engine that can respond in Sub-second and perform In-memory analytics Need for a powerful engine that can handle diverse workloads: – Batch – Streaming – Interactive – Graph – Machine Learning • • Certified Apache Spark and Scala Training – DataFlair

  5. What is Apache Spark? Apache Spark is a powerful open source engine which can handle: – Batch processing – Real-time (stream) – Interactive – Graph – Machine Learning (Iterative) – In-memory Certified Apache Spark and Scala Training – DataFlair

  6. Introduction to Apache Spark Lightening fast cluster computing tool  General purpose distributed system  Provides APIs in Scala, Java, Python, and R  Certified Apache Spark and Scala Training – DataFlair

  7. History Became Top-level project Open Sourced Most active project at Apache Donated to Apache World record in sorting Introduced by UC Berkeley 2009 2010 2011 2012 2013 2014 2015 Certified Apache Spark and Scala Training – DataFlair

  8. Sort Record 2100 Nodes Hadoop-MapReduce 72 min 206 Nodes Spark 23 min Hadoop MapReduce Spark Data Size 102.5 TB 100 TB Time Taken 72 min 23 min No of nodes 2100 206 No of cores 50400 physical 6592 virtualized Cluster disk throughput 3150 GBPS 618 GBPS Network Dedicated 10 Gbps Virtualized 10 Gbps Src: Databricks Certified Apache Spark and Scala Training – DataFlair

  9. Goals  Easy to combine batch, streaming, and interactive computations Batch One Stack to Rule them all Interactive Streaming Certified Apache Spark and Scala Training – DataFlair

  10. Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms Certified Apache Spark and Scala Training – DataFlair

  11. Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms  Compatiblewith existing open source ecosystem Certified Apache Spark and Scala Training – DataFlair

  12. Why Spark ?  100x faster than Hadoop. Certified Apache Spark and Scala Training – DataFlair

  13. Why Spark ?   100x faster than Hadoop. In-memory computation. Operation1 Operation1 Operation2 Operation1 … … Disk Disk Certified Apache Spark and Scala Training – DataFlair

  14. Why Spark ?   100x faster than Hadoop. In-memory computation. … Operation 1 Operation 2 Operation n Disk Disk Disk Disk … Operation n Operation 1 Operation 2 Disk Disk Certified Apache Spark and Scala Training – DataFlair

  15. Why Spark ?    100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Certified Apache Spark and Scala Training – DataFlair

  16. Why Spark ?     100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Batches of Processed data Batches of Input data Input data stream Spark Engine Spark Streaming Certified Apache Spark and Scala Training – DataFlair

  17. Why Spark ?      100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Certified Apache Spark and Scala Training – DataFlair

  18. Why Spark ?       100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Action (collect) Transformation 1 map() Transformation 2 filter() RDD1 RDD2 RDD3 Result Certified Apache Spark and Scala Training – DataFlair

  19. Why Spark ?        100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Compatible with hadoop, can process existing hadoop data. Certified Apache Spark and Scala Training – DataFlair

  20. Spark Architecture Certified Apache Spark and Scala Training – DataFlair

  21. Spark Nodes Nodes Master Node Slave Nodes Master Worker Certified Apache Spark and Scala Training – DataFlair

  22. Basic Spark Architecture Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Certified Apache Spark and Scala Training – DataFlair

  23. Resilient Distributed Dataset (RDD)  RDD is a simple and immutable collection of objects. Obj1 Obj2 Obj3 .... Obj n RDD Certified Apache Spark and Scala Training – DataFlair

  24. Resilient Distributed Dataset (RDD)   RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects. RDD Objects Certified Apache Spark and Scala Training – DataFlair

  25. Resilient Distributed Dataset (RDD)    RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects. Each RDD is split-up into different partitions, which may be computed on different nodes of clusters. RDD Partition1 Partition1 Partition2 Partition2 Partition3 Partition3 Partition4 Partition4 Partition5 Partition5 Partition6 Partition6 Certified Apache Spark and Scala Training – DataFlair

  26. Resilient Distributed Dataset (RDD) B2 B12 B1 Partition-1 Partition-2 Partition-3 Partition-4 Partition-5 . . . B5 B3 B4 B9 B10 B7 B11 B6 Create RDD B8 Employee-data.txt RDD Hadoop Cluster Certified Apache Spark and Scala Training – DataFlair

  27. RDD Operations RDD Operations Transformations Actions Persistence Certified Apache Spark and Scala Training – DataFlair

  28. RDD Operations – Transformation Transformation:  Set of operations that define how RDD should be transformed  Creates a new RDD from the existing one to process the data  Lazy evaluation: Computation doesn’t start until an action associated  E.g. Map, FlatMap, Filter, Union, GroupBy, etc. Certified Apache Spark and Scala Training – DataFlair

  29. RDD Operations – Action Action:  Triggers job execution.  Returns the result or write it to the storage.  E.g. Count, Collect, Reduce, Take, etc. Certified Apache Spark and Scala Training – DataFlair

  30. RDD Operations – Persistence Persistence:  Spark allows caching/Persisting entire dataset in memory  Caches the RDD in the memory for future operations Cache Primary Storage Certified Apache Spark and Scala Training – DataFlair

  31. RDD Operations Creates a new RDD based on custom business logic Parent RDD (map(), flatMap()…) Transformations RDD RDD Returns output to Driver or exports data to storage system after computation Lineage Actions (saveAsTextFile(), count()…) Result Certified Apache Spark and Scala Training – DataFlair

  32. Features of Spark Process every record exactly once 100 X Faster Than Hadoop Duplicate Elimination Speed Automatic Memory Management Diverse processing platform Memory Management Processing Window Criteria Fault Tolerance Recovers Automatically Time based window criteria Certified Apache Spark and Scala Training – DataFlair

  33. Thank You DataFlair /DataFlairWS /c/DataFlairWS Certified Apache Spark and Scala Training – DataFlair

More Related