1.08k likes | 1.49k Views
This Hadoop Tutorial on Hadoop Interview Questions and Answers ( Hadoop Interview Blog series: https://goo.gl/ndqlss ) will help you to prepare yourself for Big Data and Hadoop interviews. Learn about the most important Hadoop interview questions and answers and know what will set you apart in the interview process. Below are the topics covered in this Hadoop Interview Questions and Answers Tutorial: <br><br>Hadoop Interview Questions on: <br><br>1) Big Data & Hadoop <br>2) HDFS <br>3) MapReduce <br>4) Apache Hive <br>5) Apache Pig <br>6) Apache HBase and Sqoop <br><br>Check our complete Hadoop playlist here: https://goo.gl/4OyoTW <br><br>#HadoopInterviewQuestions #BigDataInterviewQuestions #HadoopInterview
E N D
EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Market According to Forrester: growth rate of 13% for the next 5 years, than twice w.r.t. predicted general IT growth Hadoop Market which is more U.S. and International Operations (29%) and Enterprises (27%) lead the adoption of Big Data globally Asia Pacific to be fastest growing Hadoop market with a CAGR of 59.2 % Companies focusing on improving customer relationships (55%) and making the business more data-focused (53%) 2013 2014 2015 2016 CAGR of 58.2 % EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Hadoop Job Trends EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Agenda for Today Hadoop Interview Questions Big Data & Hadoop HDFS MapReduce Apache Hive Apache Pig Apache HBase and Sqoop EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Interview Questions “The harder I practice, the luckier I get.” Gary Player EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. What are the five V’s associated with Big Data? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. What are the five V’s associated with Big Data? Big Data EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. Differentiate between structured, semi-structured and unstructured data? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. Differentiate between structured, semi-structured and unstructured data? Structured Semi - Structured Unstructured Organized data format Data schema is fixed Example: RDBMS data, etc. Partial organized data Lacks formal structure of a data model Example: XML & JSON files, etc. Un-organized data Unknown schema Example: multi - media files, etc. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. How Hadoop differs from Traditional Processing System using RDBMS? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. How Hadoop differs from Traditional Processing System using RDBMS? RDBMS Hadoop RDBMS relies on the structured data and the schema of the data is always known. Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured. RDBMS provides limited or no processing capabilities. Hadoop allows us to process the data in distributed parallel fashion. On the contrary, Hadoop follows the schema on read policy. RDBMS is based on ‘schema on write’ where schema validation is done before loading the data. In RDBMS, reads are fast because the schema of the data is already known. The writes are fast in HDFS because no schema validation happens during HDFS write. Suitable for OLTP (Online Transaction Processing) Suitable for OLAP (Online Analytical Processing) Licensed software Hadoop is an open source framework. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. Explain the components of Hadoop and their services. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. Explain the components of Hadoop and their services. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. What are the main Hadoop configuration files? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
Big Data & Hadoop Q. What are the main Hadoop configuration files? hadoop-env.sh core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml masters slaves EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Interview Questions “A person who never made a mistake never tried anything new.” Albert Einstein EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the fault tolerance capability of the system? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the fault tolerance capability of the system? HDFS replicates the blocks and stores on different DataNodes Default Replication Factor is set to 3 EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this problem. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this problem. Solution: Hadoop Archive It clubs small HDFS files into a single archive Problem: Too Many Small Files = Too Many Blocks Too Many Blocks == Too Many Metadata Managing this huge number of metadata is difficult Increase in cost of seek HDFS Files (small) .HAR file > hadoop archive –archiveName edureka_archive.har /input/location /output/location EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block? Default Block Size = 128 MB 514 MB / 128 MB = 4.05 == 5 Blocks Replication Factor = 3 Total Blocks = 5 * 3 = 15 Total size = 514 * 3 = 1542 MB EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. How to copy a file into HDFS with a different block size to that of existing block size configuration? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. How to copy a file into HDFS with a different block size to that of existing block size configuration? Block size: 32 MB = 33554432 Bytes ( Default block size: 128 MB) Command: hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /local/test.txt /sample_hdfs Check the block size of test.txt hadoop fs -stat %o /sample_hdfs/test.txt HDFS Files (existing) test.txt (HDFS) move to HDFS: /sample_hdfs test.txt (local) -Ddfs.blocksize=33554432 32 MB 32 MB 128 MB 128 MB HDFS HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. What is a block scanner in HDFS? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. What is a block scanner in HDFS? Note: This question is generally asked for the position Hadoop Admin Block scanner maintains integrity of the data blocks It runs periodically on every DataNode to verify whether the data blocks stored are correct or not Steps: 1. DataNode reports to NameNode 2. NameNode schedules the creation of new replicas using the good replicas 3. Once replication factor (uncorrupted replicas) reaches to the required level, deletion of corrupted blocks takes place EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. Can multiple clients write into an HDFS file concurrently? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. Can multiple clients write into an HDFS file concurrently? HDFS follows Single Writer Multiple Reader Model Write Read The client which opens a file for writing is granted a lease by the NameNode NameNode rejects write request of other clients for the file which is currently being written by someone else HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. What do you mean by the High Availability of a NameNode? How is it achieved? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
HDFS Q. What do you mean by the High Availability of a NameNode? How is it achieved? NameNode used to be Single Point of Failure in Hadoop 1.x High Availability refers to the condition where a NameNode must remain active throughout the cluster HDFS HA Architecture in Hadoop 2.x allows us to have two NameNode configuration. in an Active/Passive EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Interview Questions “Never tell me the sky’s the limit when there are footprints on the moon.” –Author Unknown EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. Explain the process of spilling in MapReduce? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. Explain the process of spilling in MapReduce? The output of a map task is written into a circular memory buffer (RAM). 80% 80% 20 % 50 % Spill data Default Buffer size is set to 100 MB as specified in mapreduce.task.io.sort.mb RAM Spilling is a process of copying the data from memory buffer to disc after a certain threshold is reached Local Disc Default spilling threshold is 0.8 as specified in mapreduce.map.sort.spill.percent Node Manager EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. What is the difference between blocks, input splits and records? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. What is the difference between blocks, input splits and records? Blocks Physical Division Blocks: Data in HDFS is physically stored as blocks Input Splits: Logical chunks of data to be processed by an individual mapper Input Splits Records: Each input split is comprised of records e.g. in a text file each line is a record Logical Division Records EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. What is the role of RecordReader in Hadoop MapReduce? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. What is the role of RecordReader in Hadoop MapReduce? RecordReader converts the data present in a file into (key, value) pairs suitable for reading by the Mapper task The RecordReader instance is defined by the Input Format Key 0 57 122 171 … Value 1 David 2 Cassie 3 Remo 4 Ramesh 1 David 2 Cassie 3 Remo 4 Ramesh … RecordReader Mapper EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. What is the significance of counters in MapReduce? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. What is the significance of counters in MapReduce? Used for gathering statistics about the job: for quality control for application-level statistics Easier to retrieve counters as compared to log messages for large distributed job For example: Counting the number of invalid records, etc. Counter: 0 2 1 +1 1 David 2%^&%d 3 Jeff 4 Shawn 5$*&!#$ MapReduce Output invalid records EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS? Reducer Mapper The outputs of map task are the intermediate key-value pairs which is then processed by reducer Intermediate output is not required after completion of job Local Disc NodeManager Storing these intermediate output into HDFS and replicating it will create unnecessary overhead. output HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. Define Speculative Execution EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. Define Speculative Execution slow If a task is detected to be running slower, an equivalent task MRTask (slow) task is launched so as to maintain the critical path of the progress job Node Manager Scheduler tracks the progress of all the tasks (map and Scheduler reduce) and launches speculative duplicates for slower tasks MRTask (duplicate) After completion of a task, all running duplicates task are launch killed speculative Node Manager EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper? Method 1: Increase the minimum split size to be larger than the largest file inside the driver section i. conf.set ("mapred.min.split.size", “size_larger_than_file_size"); ii. Input Split Computation Formula - max ( minimumSize, min ( maximumSize, blockSize ) ) Method 2: Modify the InputFormat class that you want to use: i. Subclass the concrete subclass of FileInputFormat and override the isSplitable() method to return false as shown below: public class NonSplittableTextInputFormat extends TextInputFormat { @Override protected boolean isSplitable (JobContext context, Path file) { return false; } } EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case? HDFS (Input) HDFS (Output) Legal to set the number of reducer task to zero Map Reduce It is done when there is no need for a reducer like in the cases where inputs needs to be transformed into a particular format, map side join etc. Reducer set to zero Map outputs is directly stored into the HDFS as specified by the client HDFS (Input) HDFS (Output) Map Reduce EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. What is the role of Application Master in a MapReduce Job? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. What is the role of Application Master in a MapReduce Job? Client RM NM AM submit job Acts as a helper process for ResourceManager launch AM Initializes the job and track of the job’s progress Retrieves the input splits computed by the client ask for resources Negotiates the resources needed for running a job with the ResourceManager run task Creates a map task object for each split status unregister EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop
MapReduce Q. What do you mean by MapReduce task running in uber mode? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop