What Is Hadoop | Hadoop Tutorial For Beginners | Edureka

Agenda 1. 5 V’s of Big Data 2. Problems with Big Data 3. Hadoop-as-a solution 4. What is Hadoop? 5. HDFS 6. YARN 7. MapReduce 8. Hadoop Ecosystem EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

5V’s of Big Data EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

5 V’s of Big Data Volume Data is being generated at an accelerating speed Value Mechanism to bring the correct meaning out of the data Value? Variety Different kinds of data is being generated from various sources Veracity Uncertainty and inconsistencies in the data Velocity Data is being generated at an alarming rate EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Problems with Big Data Processing EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Problems with Big Data Highly Scalable Processing data having complex structure (structured, un-structured, semi- structured) Storing huge and exponentially growing datasets Bringing huge amount of data to computation unit becomes a bottleneck EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

So for Big Data problem statement, Hadoop emerged as a solution…. What is Hadoop? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Hadoop Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion Allows to dump any kind of data HDFS (Storage) across the cluster Allows parallel processing of the MapReduce (Processing) data stored in HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Hadoop Storing exponentially growing huge datasets Storing unstructured data Processing data faster Allows to store any kind of data, be it structured, semi- structured or unstructured Provides parallel processing of data present in HDFS HDFS, storage unit of Hadoop is a Distributed File System Allows to process data locally i.e. each node works with a part of data which is stored on it 1 2 3 Write Read 1 hr. HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Hadoop Distributed File System (HDFS) EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

HDFS Master Node HDFS ▪ Storage unit of Hadoop NameNode ▪ Distributed File System ▪ Divide files (input data) into smaller chunks and stores it across the cluster ▪ Horizontal Scaling as per requirement Slave Node ▪ Stores any kind of data ▪ No schema validation is done while dumping data DataNode DataNode DataNode EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

HDFS Block • HDFS stores the data in form of blocks • Block size can be configured base on requirements 128 MB 128 MB file.xml 128 MB 128 MB moving to HDFS HDFS Cluster HDFS Blocks Note: The default Block Size is 128 MB EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

NameNode NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes Secondary NameNode NameNode DataNode DataNode DataNode EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

DataNode Secondary NameNode NameNode DataNode ▪ Slave daemons DataNode DataNode DataNode ▪ Stores actual data ▪ Serves read and write requests EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Secondary NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour) Secondary NameNode NameNode DataNode DataNode DataNode EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Hadoop Distributed File System Secondary NameNode NameNode Secondary NameNode NameNode editLog editLog First time copy fsImage fsImage DataNode DataNode DataNode editLog (new) FsImage (final) Temporary During checkpoint EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

YARN (Yet Another Resource Negotiator) EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

YARN ResourceManager • Receives the processing requests • Passes the parts of requests to corresponding NodeManagers Resource Manager NodeManagers • Installed on every DataNode • Responsible for execution of task on every single DataNode Node Manager Node Manager Node Manager EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

YARN Architecture ResourceManager has two components: Schedulers & ApplicationsManager NodeManager has two components: ApplicationMaster & Container Resource Manager App Manager Node Status Resource Request Client MapReduce Status Node Manager Node Manager Node Manager App Master App Master App Master container container container EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

YARN Architecture ApplicationsManager • ApplicationsManager accepts the job submission • Negotiates to containers for executing the application specific ApplicationMaster and monitoring the progress Resource Manager App Manager Node Status Resource Request Node Manager ApplicationsMaster ApplicationMasters are the deamons which reside on DataNode Communicates to containers for execution of tasks on each DataNode • App Master container • EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Hadoop Architecture Bigger Picture EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Hadoop Architecture HDFS YARN JobHistory Secondary NameNode NameNode ResourceManager Server NodeManger NodeManager DataNode DataNode App Master App Master container container NodeManger NodeManager DataNode DataNode App Master App Master container container EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

MapReduce EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

MapReduce MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

MapReduce Job Workflow MAPPING INPUT REDUCING FINAL RESULT SPLITTING SHUFFLING IND, (1,1,1) IND, 3 IND, 1 ENG, 1 IND, ENG, AUS, NZ AUS, 1 NZ, 1 ENG, (1,1) ENG, 2 IND, 3 ENG, 2 IND ENG AUS NZ NZ, 1 ENG, 1 AUS, 3 NZ ENG AUS IND NZ, ENG, AUS, IND AUS, (1,1,1) AUS, 3 AUS, 1 IND, 1 NZ, 3 AUS IND SL NZ SL, 1 NZ, (1,1,1) NZ, 3 AUS, 1 IND, 1 AUS, IND, SL, NZ SL, 1 NZ, 1 SL, (1) SL, 1 EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

Hadoop Ecosystem EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

What Is Hadoop | Hadoop Tutorial For Beginners | Edureka

What Is Hadoop | Hadoop Tutorial For Beginners | Edureka

Presentation Transcript

Hadoop

Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

HADOOP

Hadoop Tutorial

Hadoop

Hadoop

Hadoop What is it why it Matters? | Hadoop Online Training

Hadoop Admin for Beginners

Hadoop tutorial

Big Data Hadoop Tutorial for Beginners

what is hadoop ?