190 likes | 390 Views
CS525 : Big Data Analytics. MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner. Large-Scale Data Analytics. Many enterprises turn to Hadoop computing paradigm for big data applications : . vs. Database.
E N D
CS525:Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner
Large-Scale Data Analytics • Many enterprises turn to Hadoop computing paradigm for big data applications : vs. Database Scalability (petabytes of data, thousands of machines) Performance (indexing, tuning, data organization tech.) Flexibility in accepting all data formats (no schema) Focus on read + write, concurrency, correctness, convenience, high-level access Efficient fault tolerance support Advanced Features: - Full query support - Clever optimizers - Views and security - Data consistency - …. Commodity inexpensive hardware
What is Hadoop • Hadoop is a simple software framework for distributed processing of large datasets across huge clusters of (commodity hardware) computers : • Large datasets Terabytes or petabytes of data • Large clusters Hundreds or thousands of nodes • Open-source implementation for Google MapReduce • Simple programming model : MapReduce • Simple data model: flexible for any data
Hadoop Framework • Two main layers: • Distributed file system (HDFS) • Execution engine (MapReduce) Hadoop is designed as a master-slave shared-nothing architecture
Key Ideas of Hadoop • Automatic parallelization & distribution • Hidden from end-user • Fault tolerance and automatic recovery • Failed nodes/tasks recover automatically • Simple programming abstraction • Users provide two functions “map” and “reduce”
Who Uses Hadoop ? • Google: Invent MapReduce computing paradigm • Yahoo: Develop Hadoop open-source of MapReduce • Integrators: IBM, Microsoft, Oracle, Greenplum • Adopters: Facebook, Amazon, AOL, NetFlex,LinkedIn • Many others …
Hadoop Distributed File System (HDFS) 1 2 3 4 5 Centralized namenode - Maintains metadata info about files File F Blocks (64 MB) Many datanodes (1000s) - Store actual data - Files are divided into blocks - Each block is replicated N times (Default = 3)
HDFS File System Properties • Large Space: An HDFS instance may consist of thousands of server machines for storage • Replication: Each data block is replicated • Failure: Failure is norm rather than exception • Fault Tolerance: Automated detection of faults and recovery
Map-Reduce Execution Engine(Example: Color Count) Produces (k, v) ( , 1) Shuffle & Sorting based on k Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Input blocks on HDFS Produces(k’, v’) ( , 100) Users only provide the “Map” and “Reduce” functions
MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) • Decides on where to run each mapper (locality) Node 3 Node 1 Node 2 • This file has 5 Blocks run 5 map tasks • Run task reading block “1” on Node 1 or 3.
MapReduce Engine • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs task to completion (either map or reduce task) • Communicates with Job Tracker to report its progress 1 map-reduce job consists of 4 map tasks and 3 reduce tasks
About Key-Value Pairs • Developer provides Mapper and Reducer functions • Developer decides what is key and what is value • Developer must follow the key-value pair interface • Mappers: • Consume <key, value> pairs • Produce <key, value> pairs • Shuffling and Sorting: • Groups all similar keys from all mappers, • sorts and passes them to a certain reducer • in the form of <key, <list of values>> • Reducers: • Consume <key, <list of values>> • Produce <key, value>
Another Example : Word Count • Job: Count occurrences of each word in a data set Reduce Tasks Map Tasks