510 likes | 1.41k Views
( Hadoop Training: https://www.edureka.co/hadoop ) <br>This Edureka "What is Hadoop" tutorial ( Hadoop Blog series: https://goo.gl/LFesy8 ) helps you to understand how Big Data emerged as a problem and how Hadoop solved that problem. This tutorial will be discussing about Hadoop Architecture, HDFS & it's architecture, YARN and MapReduce in detail. Below are the topics covered in this tutorial: <br><br>1) 5 Vu2019s of Big Data <br>2) Problems with Big Data <br>3) Hadoop-as-a solution <br>4) What is Hadoop? <br>5) HDFS <br>6) YARN <br>7) MapReduce <br>8) Hadoop Ecosystem
E N D
Agenda 1. 5 V’s of Big Data 2. Problems with Big Data 3. Hadoop-as-a solution 4. What is Hadoop? 5. HDFS 6. YARN 7. MapReduce 8. Hadoop Ecosystem EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
5V’s of Big Data EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
5 V’s of Big Data Volume Data is being generated at an accelerating speed Value Mechanism to bring the correct meaning out of the data Value? Variety Different kinds of data is being generated from various sources Veracity Uncertainty and inconsistencies in the data Velocity Data is being generated at an alarming rate EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Problems with Big Data Processing EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Problems with Big Data Highly Scalable Processing data having complex structure (structured, un-structured, semi- structured) Storing huge and exponentially growing datasets Bringing huge amount of data to computation unit becomes a bottleneck EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
So for Big Data problem statement, Hadoop emerged as a solution…. What is Hadoop? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion Allows to dump any kind of data HDFS (Storage) across the cluster Allows parallel processing of the MapReduce (Processing) data stored in HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Storing exponentially growing huge datasets Storing unstructured data Processing data faster Allows to store any kind of data, be it structured, semi- structured or unstructured Provides parallel processing of data present in HDFS HDFS, storage unit of Hadoop is a Distributed File System Allows to process data locally i.e. each node works with a part of data which is stored on it 1 2 3 Write Read 1 hr. HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Distributed File System (HDFS) EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Master Node HDFS ▪ Storage unit of Hadoop NameNode ▪ Distributed File System ▪ Divide files (input data) into smaller chunks and stores it across the cluster ▪ Horizontal Scaling as per requirement Slave Node ▪ Stores any kind of data ▪ No schema validation is done while dumping data DataNode DataNode DataNode EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Block • HDFS stores the data in form of blocks • Block size can be configured base on requirements 128 MB 128 MB file.xml 128 MB 128 MB moving to HDFS HDFS Cluster HDFS Blocks Note: The default Block Size is 128 MB EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
NameNode NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes Secondary NameNode NameNode DataNode DataNode DataNode EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
DataNode Secondary NameNode NameNode DataNode ▪ Slave daemons DataNode DataNode DataNode ▪ Stores actual data ▪ Serves read and write requests EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Secondary NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour) Secondary NameNode NameNode DataNode DataNode DataNode EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Distributed File System Secondary NameNode NameNode Secondary NameNode NameNode editLog editLog First time copy fsImage fsImage DataNode DataNode DataNode editLog (new) FsImage (final) Temporary During checkpoint EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
YARN (Yet Another Resource Negotiator) EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
YARN ResourceManager • Receives the processing requests • Passes the parts of requests to corresponding NodeManagers Resource Manager NodeManagers • Installed on every DataNode • Responsible for execution of task on every single DataNode Node Manager Node Manager Node Manager EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
YARN Architecture ResourceManager has two components: Schedulers & ApplicationsManager NodeManager has two components: ApplicationMaster & Container Resource Manager App Manager Node Status Resource Request Client MapReduce Status Node Manager Node Manager Node Manager App Master App Master App Master container container container EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
YARN Architecture ApplicationsManager • ApplicationsManager accepts the job submission • Negotiates to containers for executing the application specific ApplicationMaster and monitoring the progress Resource Manager App Manager Node Status Resource Request Node Manager ApplicationsMaster ApplicationMasters are the deamons which reside on DataNode Communicates to containers for execution of tasks on each DataNode • App Master container • EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Architecture Bigger Picture EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Architecture HDFS YARN JobHistory Secondary NameNode NameNode ResourceManager Server NodeManger NodeManager DataNode DataNode App Master App Master container container NodeManger NodeManager DataNode DataNode App Master App Master container container EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Job Workflow MAPPING INPUT REDUCING FINAL RESULT SPLITTING SHUFFLING IND, (1,1,1) IND, 3 IND, 1 ENG, 1 IND, ENG, AUS, NZ AUS, 1 NZ, 1 ENG, (1,1) ENG, 2 IND, 3 ENG, 2 IND ENG AUS NZ NZ, 1 ENG, 1 AUS, 3 NZ ENG AUS IND NZ, ENG, AUS, IND AUS, (1,1,1) AUS, 3 AUS, 1 IND, 1 NZ, 3 AUS IND SL NZ SL, 1 NZ, (1,1,1) NZ, 3 AUS, 1 IND, 1 AUS, IND, SL, NZ SL, 1 NZ, 1 SL, (1) SL, 1 EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Ecosystem EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Ecosystem EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop