170 likes | 457 Views
Hadoop. Carson Gallimore , Chris Zingraf , Jonathan Light. Contents. Hadoop Overview MapReduce HDFS History Architecture Applications. What is Hadoop?. Open Source software project Used to distribute the processing of large data sets over clusters of servers.
E N D
Hadoop Carson Gallimore, Chris Zingraf, Jonathan Light
Contents • Hadoop Overview • MapReduce • HDFS • History • Architecture • Applications
What is Hadoop? • Open Source software project • Used to distribute the processing of large data sets over clusters of servers. • Software is resilient because it is great at detecting and handling failures at the application layer. http://tinyurl.com/m33wgcw
Overview • Hadoop contains a lot of apache projects (e.g. Pig, Hive, Zookeeper) • Mainly relies on MapReduce and HDFS (Hadoop Distributed File System) • MapReduce is a framework that assigns work to the nodes in a cluster • HDFS is a file system that spans over all of the nodes in the cluster to store data. http://www.ibmbigdatahub.com/sites/default/files/public_images/hadoop.jpg
MapReduce • “MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster”. http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ http://people.apache.org/~rdonkin/hadoop-talk/diagrams/map-reduce.png
Example: http://www-01.ibm.com/software/ebusiness/jstart/graphics/hadoopDiagram.png
HDFS • The HDFS breaks down the data in the cluster into small blocks and distributes them throughout the cluster. • This helps with scalability because you can break down the data making the map and reduce functions able to work on smaller subsets of the large data sets. • The goal of Hadoop is to use common servers with inexpensive internal disk drives in large clusters
HDFS, Cont. • More machines means potentially higher fault rate • Hadoop was developed with high fail rates in mind • Hadoop has built-in fault tolerance and compensation capabilities. The same for HDFS.
HDFS, Cont. • The data gets divided into blocks, and then copies of these blocks are made. • The copied blocks are then stored throughout the other servers in the cluster. • This was if the cluster fails, you can get the file by combining the copied blocks
History • Underlying technology invented by Google in order to index the rich textural and structural information. • Designed to solve large data problems where you have a mixture of structured and complex data.
History, Cont. • Uses a MapReduce engine, HDFS • Written in Java • Being consistently built and used by a global community of contributors.
Architecture • Designed to run on many machines that do not share memory or disks. • The software busts data into pieces and spread it across all the machines. • To achieve this Hadoop implements MapReduce.
Architecture, Cont. • Hadoop keeps track of where all the data resides and keeps copies in case of a server failure. • There are many different ways to customize Hadoop to fit specific needs.
Applications • Hadoop can be applied to multiple markets. • Including: - Risk analysis for financing corporations - online retail, product suggestions
References • Turner, James. January 12, 2011. Hadoop: what it is, how it works, and what it can do. <http://strata.oreilly.com/2011/01/what-is-hadoop.html> • Wikipedia. September 18, 2013. Apache Hadoop.<http://en.wikipedia.org/wiki/Hadoop>
References cont. • What is Hadoop?< http://www-01.ibm.com/software/data/infosphere/hadoop/> • What is MapReduce?<http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/> • What is HDFS?<http://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/>