200 likes | 456 Views
Apache Hadoop. Daniel Lust , Anthony Taliercio. What is Apache Hadoop ?. Allows applications to utilize thousands of nodes while exchanging thousands of terabytes of data to complete a task Supports distributed applications under a free license Used by many popular companies
E N D
Apache Hadoop Daniel Lust , Anthony Taliercio
What is Apache Hadoop? • Allows applications to utilize thousands of nodes while exchanging thousands of terabytes of data to complete a task • Supports distributed applications under a free license • Used by many popular companies • Such as: Facebook, Twitter, Ebay, IBM, Apple, Microsoft, Hewlett-Packard, and many others…
Continued… • Written in Java • Scales well • Can be used with thousands of nodes • Can be used with just a few nodes and inexpensive hardware • Your average Hadoop cluster will consist of two major parts • A single master node and multiple working nodes. • The master node is made up of four parts: the Job Tracker, Task Tracker, NameNode, and DataNode. • A worker node, which is also known as a slave node, can either be a DataNode and TaskTracker or just one of the two.
Overview Of Hadoop • - Hadoop uses whats called an HDFS • Hadoop Distributed File System • HDFS takes files and splits them across the networkredundantly in a cluster • The redundancy to eliminate possible data loss
MapReduce • MapReduce • Software wrote by googleto process massive amounts of unstructured data in a parallel process across a distributed cluster of processors
MapReduce . Offers a clean abstraction between data analysis tasks, organizing the jobs Issued by the HDFS, so no jobs are unnecessarily repeated. - If one of them fail, a node may point to a different node to complete the task
Running Hadoop • First run of Hadoop on Master Computer • Various processes are started including: • TaskTracker • JobTracker • DataNode • Secondary Node • NameNode It also makes a connection through SSH to other SLAVE computers to start a DataNode and TaskTracker
Running Hadoop • Used Hadoop to do a word count on six different books. • HDFS copied the books to different clusters, and ran a pre-written program to do a word count on the books. • Each node returned data, using the DataNodeproccess to save its results. • When a node failed, it will issue the job to another node
Tested on 1-3 Nodes • 1 NODE: JOB COMPLETION • 00:01:45 2 NODES: JOB COMPLETION 00:01:28 3 NODE : JOB COMPLETION 00:01:00
Conclusion • Our guide covered everything you need to get started with Apache Hadoop • Although, there are many problems you can see along the way • Troubleshooting was a large part of our project