240 likes | 397 Views
Hadoop. Joshua Nester, Garrison Vaughan , Calvin Sauerbier , Jonathan Pingilley , and Adam Albertson. Overview – What is Hadoop ?. Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce .
E N D
Hadoop Joshua Nester, Garrison Vaughan, Calvin Sauerbier, Jonathan Pingilley, and Adam Albertson
Overview – What is Hadoop? • Hadoopis a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce. • It provides a distributed filesystem (HDFS) that can store data across thousands of servers, and a means of running work (Map/Reduce jobs) across those machines, running the work near the data. • It runs on Java 1.6.x or higher with support for both Linux and Windows
Overview – What does it do? • Hadoopprovides a way to solve problems using big data. • It can be used to interact with and analyze data that doesn't fit neatly in a database (financial portfolios, targeted product advertising).
Overview – Brief History • Hadoop was built by Doug Cutting and Michael J. Cafarella. The name Hadoop came from Doug’s son’s stuffed toy elephant, which is also where Hadoop got its logo. • Hadoopwas originally built as an infrastructure for the Nutch project, which crawls the web and builds a search engine index for the crawled pages.
Overview – How does it work? • MapReduceexpresses large distributed computation as a sequence of distributed operations on data sets of key/value pairs. • A Map/Reduce computation has two phases, a map phase and a reduce phase.
Overview – How does it work? • Map • The framework splits the input data set into a large number of • fragments and assigns each fragment to a map task. • Each map task consumes key/value pairs from its assigned fragment and produces a set of intermediate key/value pairs. • Following the map phase the framework sorts the intermediate data set by key and produces a set of tuples so that all the values associated with a particular key appear together. • The set of tuples are partitioned into a number of fragments equal to the number of reduce tasks that will be performed.
Overview – How does it work? • Reduce • Each reduce task consumes the fragment of tuples assigned to it and transmutes the tuple into an output key/value pair as described by the user defined reduce function. • Once again, the framework distributes the many reduce tasks across the cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each reduce task. • Tasks in each phase are executed in a fault-tolerant manner, if nodes fail in the middle of a computation the tasks assigned to them are re-distributed among the remaining nodes and any data that can be recovered is also re-distributed. • Having many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with small runtime overhead.
Overview - Architecture • Hadoop has a master/slave architecture. • A single master server or jobtracker and several slave servers or tasktrackers, one per node in the cluster. • The jobtracker is the point of interaction between users and the framework. • Users submit map/reduce jobs to the jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/first-served basis. • The jobtracker manages the assignment of map and reduce tasks to the tasktrackers. • The tasktrackers execute tasks upon instruction from the jobtracker and also handle data motion between the map and reduce phases.
Overview - Hardware • Hadoop is designed to run on a large number of machines that don’t share any memory or disks. (aka. most PCs and servers) • Scaling is best with nodes that have dual processors/cores with 4-8GB of RAM. • The best cost/performance places the machines at ½ to 1/3 of the cost of application servers but above the cost of standard desktop machines. • Bandwidth needed depends on the jobs being run and the number of nodes. Most average jobs produce around 100MB/s of data.
Overview - Hadoop DFS • Designed to reliably store very large files across machines in a large cluster. • Inspired by the Google File System. • Each file is stored as a sequence of blocks, all blocks in a file except the last block are the same size. • Blocks belonging to a file are replicated for fault tolerance. • The block size and replication factor are configurable per file. • Files in HDFS are "write once" and have strictly one writer at any time.
Overview - HDFS Architecture • HDFS also follows a master/slave architecture. • Installation consists of a single Namenode, a master server that manages the filesystem namespace and regulates access to files by clients. • There are a number of Datanodes, one per node in the cluster, which manage storage attached to the nodes that they run on. • The Namenode makes filesystem namespace operations like opening, closing, renaming etc. of files and directories and also determines the mapping of blocks to Datanodes. • The Datanodes are responsible for serving read and write requests from filesystem clients, they also perform block creation, deletion, and replication upon instruction from the Namenode.
Overview - Developing for Hadoop • A Hadoop adopter must be more sophisticated than a relational database adopter. • There are not that many “turn-key” applications that are ready to use. • Each company that uses Hadoop will likely have to be adept enough to create their own programs in house. • Hadoop is written in Java but APIs are available for multiple languages and many vendors are now providing tools to assist.
What Hadoop is not… • Asilver bullet that will solve all application/datacenter problems. • Areplacement for a database or SAN. • A replacement for an existing NFS. • Always the fastest solution. If operations rely on the output preceding operations Hadoop/MapReduce will likely provide no benefit. • Aplace to learn java. • Aplace to learn networking. • Aplace to learn Unix/Linux system administration.
Proposal - Purpose • The purpose of this assignment is to setup up and configure a Hadoop cluster so that we can effectively distributed processes between different machines. • MapReducealgorithms can then be run on the machine to process extremely large amounts of data • Hadoop will also allow us to effectively see how see fault tolerance.
Proposal - Outline • Hardware • Network Topology • OS/System Software • Hadoop • Benchmark Tool • MapReduce • Applying Distributed Environment
Proposal - Hardware • Utilizing 2 pods + possible servers or other computers • 2 Compatible Computers at each pod • 64-bit architecture machines • 1 Control System if access to network is wanted we can setup one node with access • All computers are potential slaves • 1x24-port Cisco Switch • Hadoop can also run on a single-node cluster
Proposal – Network Topology • High Level View
Proposal – OS/Software • Supported operating systems • Linux and Windows: BSD, Max OS/X and OpenSolaris would also work. • Since Linux is an openSource OS we will use it for our installation. • Needs Java 1.6.x or higher to run • Platform Specifics • Most of Hadoop is built with Java, but a growing number is written in C and C++ • Makes it harder to port because of functionality issues • Successfully tested setup • Linux 10.04 LTS, 8.10, 8.04 LTS, 7.10 • Hadoop 1.0.3, released May 2012
Proposal – OS/Software Cont. • Relies heavily on Unix/Linux system knowledge • Linux provides configurability for the following (must know): • SSH, how to use both ssh and scp • Ifconfig, nslookup, other network programs • Known logs • Setup and mount file systems • Basically, we must have a good working knowledge of a Linux system already (or know how to use Google) • A working knowledge of Java Programming is also needed to work through possible errors if needed.
Proposal - Releases • Hadoop Releases: http://hadoop.apache.org/releases.html • Software library framework allows for distributed processing • Library is designed to detect and handle failures at an application level. • Delivers high availability • Includes common utilities to support Hadoop modules • Contains framework for jobs • Scheduling and cluster resource management • Implementation of MapReduce • Hadoop Distributed File System (HDFS) • Runs on commodity hardware • Fault tolerance • Restarting tasks • Data replication
Proposal – Why Hadoop • Companies that implement Hadoop • IBM • LinkedIN • Adobe • Twitter • Facebook • Amazon • Yahoo! • Rackspace • We can clearly see that Hadoop is worth the hype!!!
Proposal - Benchmarking • HadoopMapReduce • HadoopStreaming allows shells to be used to execute map or reduce functions. • Operation can be run in parallel on different lists of data. • Pushes out program to machines • Output saved to distribution filesystem • Job tracker keeps track of MapReduce Jobs • Success and failure • Works to complete entire job • Provides its own distributed filesystem and runs jobs near data stored on each filesystem
Proposal – Benchmarking Cont • Maximum parallelism • Maps and Reduces must be stateless • You can’t control the order in which it maps runs or the reductions • Won’t get data back until the entire mapping has completed • Nodes do report back periodically just not in full feature • Used in several different environments • Multi-core and many-core systems, desktop grids, dynamic cloud environments • Example: • grep -Eh <regex> <inDir>/* | sort | uniq -c | sort –nr • Used to count lines in all files in directory that match a regex condition. • We can see powerful and not so powerful applications dealing with MapReduce
Proposal – Useful Applications • Word Counting projects. • Generating PDF files for many articles as scanned images • Google implementation-locate roads connected to given intersection • Rendering maps • Finding nearest features • Page Ranking • Machine Translation