1 / 13

MapReduce & Hadoo p

MapReduce & Hadoo p. IT332 Distributed Systems. Outline. MapReduce Hadoop Cloudera Hadoop Tutorial. MapReduce. MapReduce is a programming model for data processing The power of MapReduce lies in its ability to scale to 100s or 1000s of computers, each with several processor cores

carl
Download Presentation

MapReduce & Hadoo p

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce & Hadoop IT332 Distributed Systems

  2. Outline • MapReduce • Hadoop • ClouderaHadoop • Tutorial

  3. MapReduce • MapReduce is a programming model for data processing • The power of MapReduce lies in its ability to scale to 100s or 1000s of computers, each with several processor cores • How large an amount of work? • Web-Scale data on the order of 100s of GBs to TBs or PBs • It is likely that the input data set will not fit on a single computer’s hard drive • Hence, a distributed file system (e.g., Google File System- GFS) is typically required

  4. MapReduce Characteristics • MapReduceties smaller and more reasonably priced machines together into a single cost-effective commodity cluster • MapReducedivides the workload into multiple independenttasksand schedule them across cluster nodes • A work performed by each task is done in isolation from one another

  5. Data Distribution • In a MapReduce cluster, data is distributed to all the nodes of the cluster as it is being loaded in • An underlying distributed file systems (e.g., GFS) splits large data files into chunks which are managed by different nodes in the cluster • Even though the file chunks are distributed across several machines, they form a single namesapce Input data: A large file Node 1 Node 2 Node 3 Chunk of input data Chunk of input data Chunk of input data

  6. MapReduce: A Bird’s-Eye View • In MapReduce, chunks are processed in isolation by tasks called Mappers • The outputs from the mappers are denoted as intermediate outputs (IOs) and are brought into a second set of tasks called Reducers • The process of bringing together IOs into a set of Reducers is known as shuffling process • The Reducers produce the final outputs (FOs) • Overall, MapReduce breaks the data flow into two phases, map phaseand reduce phase chunks C0 C1 C2 C3 Map Phase M0 M1 M2 M3 mappers IO0 IO1 IO2 IO3 Shuffling Data Reduce Phase R0 R1 Reducers FO0 FO1

  7. Keys and Values • The programmer in MapReduce has to specify two functions, the map function and the reduce function that implement the Mapper and the Reducer in a MapReduce program • In MapReduce data elements are always structured as key-value (i.e., (K, V)) pairs • The map and reduce functions receive and emit (K, V) pairs Input Splits Intermediate Outputs Final Outputs (K, V) Pairs (K’, V’) Pairs (K’’, V’’) Pairs Map Function Reduce Function

  8. Hadoop • Hadoopis an open source implementation of MapReduceand is currently enjoying wide popularity • Hadoop presents MapReduce as an analytics engine and under the hood uses a distributed storage layer referred to as Hadoop Distributed File System (HDFS) • HDFS mimics Google File System (GFS)

  9. ClouderaHadoop

  10. ClouderaVirtual Manager • Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop. • Requirements: • A64-bit host OS • A virtualization software: VMware Player, KVM, or VirtualBox. • Virtualization Software will require a laptop that supports virtualization. If you are unsure, one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled. • A 4 GB of total RAM. • The total system memory required varies depending on the sizeof your data set and on the other processes that are running.

  11. Installation • Step#1: Download & Run Vmware • Step#2: Download Cloudera VM • Step#3: Extract to the Cloudera folder. • Step#4: Open the "cloudera-quickstart-vm-4.4.0-1-vmware"

  12. Once you got the software installed, fire up the VirtualBox image of ClouderaQuickStart VM and you should see the initial screen similar to below:

  13. WordCount Tutorial • This example computes the occurrence frequency of each word in a text file. • Steps: • Set up Hadoop environment • Upload files into HDFS • Executing Java MapReduce functions in Hadoop • Tutorial: http://edataanalyst.com/2013/08/the-definitive-cloudera-hadoop-wordcount-tutorial/

More Related