900 likes | 916 Views
Learn how Spark overcomes MapReduce's limitations by providing fast, in-memory processing in a unified platform for scalable, parallel data processing. Discover key features, framework benefits, and essential entities in Spark architecture.
E N D
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets across a cluster of commodity servers. Internal components: HDFS & YARN with Mapreduce
What is HDFS HDFS is a file system to store the data in reliable manner. It consists of two types of nodes called NameNode and DataNode to store metadata and actual data. HDFS is a block-structured file system. Just like Linux file systems, HDFS splits a file into fixed-sizeblocks, also known as partitions or splits. The default block size is 128 MB.
YARN YARN is a distributed OS also called Cluster manager to process huge amount of data paralelly and quickly. At a time process different types of data such as Batch process, streaming, iterative data and more. It's unified stack.
What is Mapreduce? Mapreduce is a processing engine in Hadoop. It can process only batch data. It means bounded data. Internally it process disk to disk. So It's very very slow. Manually optimize everything, allows different ecosystems like HIve, Pig, and more to process the data.
HDFS is No1 to store data paralelly • There is no competetor to store data reliabelly in scalable manner with Low cost. • But problem is process the data quickly. • How to overcome to process quickly? • The problem with Mapreduce is It's very very slow • How to resolve it?
Problem - Solution Disk to Disk processing Very Very slow. So that Mapreduce taking a lot of time. Framework - framework creates new processing problems. In-memory Processing is processing data everything in RAM. So that very very processing
LIBRARY lIBRARY LIBRARY LIBRARY
Why I switch to Spark? The key features of Spark include the following: • Easy to use (progrmmer friendly) • Fast (in-memory) • General-purpose • Scalable parallelly process the data • Optimized Fault tolerant Unified platform
Different type of data Batch processing-- Hadoop Streaming --- Strom Iterative --MLLib or graphx Interactive --SQL/BI
key entities ........................ 1) driver program, 2) cluster manager, 3) worker node, 4) executors, 5) tasks
What is Driver Program? The spark driver is the program that declares/defines the transformations and actions on RDDs of data and submits such requests to the master. Where the driver program is placed to process, that node is called Driver node, it might either within or out of the cluster.
Cluster manager(Yarn) It's a distributed OS. It's schedule the tasks and allocate the resources in the cluster. Allocate RAM and CPUS to Executors based on Node manager request
Worker nodes/node manager In Hadoop terminaligy it's also called node manager It's manage the executors If executors cross limits, nodemanager kill the executors
Tasks • A task is the smallest unit of work that Spark sends to an executor. It is executed by a thread in an executoron a worker node. Each task performs some computations to either return a result to a driver program or S3/hdfs. • Spark creates a task per data partition. An executor runs one or more tasks concurrently. The amountof parallelism is determined by the number of partitions. More partitions mean more tasks processing datain parallel.