CS525 : Big Data Analytics

CS525:Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner

Large-Scale Data Analytics • Many enterprises turn to Hadoop computing paradigm for big data applications : vs. Database Scalability (petabytes of data, thousands of machines) Performance (indexing, tuning, data organization tech.) Flexibility in accepting all data formats (no schema) Focus on read + write, concurrency, correctness, convenience, high-level access Efficient fault tolerance support Advanced Features: - Full query support - Clever optimizers - Views and security - Data consistency - …. Commodity inexpensive hardware

What is Hadoop • Hadoop is a simple software framework for distributed processing of large datasets across huge clusters of (commodity hardware) computers : • Large datasets  Terabytes or petabytes of data • Large clusters  Hundreds or thousands of nodes • Open-source implementation for Google MapReduce • Simple programming model : MapReduce • Simple data model: flexible for any data

Hadoop Framework • Two main layers: • Distributed file system (HDFS) • Execution engine (MapReduce) Hadoop is designed as a master-slave shared-nothing architecture

Key Ideas of Hadoop • Automatic parallelization & distribution • Hidden from end-user • Fault tolerance and automatic recovery • Failed nodes/tasks recover automatically • Simple programming abstraction • Users provide two functions “map” and “reduce”

Who Uses Hadoop ? • Google: Invent MapReduce computing paradigm • Yahoo: Develop Hadoop open-source of MapReduce • Integrators: IBM, Microsoft, Oracle, Greenplum • Adopters: Facebook, Amazon, AOL, NetFlex,LinkedIn • Many others …

Hadoop Distributed File System (HDFS) 1 2 3 4 5 Centralized namenode - Maintains metadata info about files File F Blocks (64 MB) Many datanodes (1000s) - Store actual data - Files are divided into blocks - Each block is replicated N times (Default = 3)

HDFS File System Properties • Large Space: An HDFS instance may consist of thousands of server machines for storage • Replication: Each data block is replicated • Failure: Failure is norm rather than exception • Fault Tolerance: Automated detection of faults and recovery

Map-Reduce Execution Engine(Example: Color Count) Produces (k, v) ( , 1) Shuffle & Sorting based on k Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Input blocks on HDFS Produces(k’, v’) ( , 100) Users only provide the “Map” and “Reduce” functions

MapReduce Engine • Job Tracker is the master node (runs with the namenode) • Receives the user’s job • Decides on how many tasks will run (number of mappers) • Decides on where to run each mapper (locality) Node 3 Node 1 Node 2 • This file has 5 Blocks  run 5 map tasks • Run task reading block “1” on Node 1 or 3.

MapReduce Engine • Task Tracker is the slave node (runs on each datanode) • Receives the task from Job Tracker • Runs task to completion (either map or reduce task) • Communicates with Job Tracker to report its progress 1 map-reduce job consists of 4 map tasks and 3 reduce tasks

About Key-Value Pairs • Developer provides Mapper and Reducer functions • Developer decides what is key and what is value • Developer must follow the key-value pair interface • Mappers: • Consume <key, value> pairs • Produce <key, value> pairs • Shuffling and Sorting: • Groups all similar keys from all mappers, • sorts and passes them to a certain reducer • in the form of <key, <list of values>> • Reducers: • Consume <key, <list of values>> • Produce <key, value>

MapReduce Phases

Another Example : Word Count • Job: Count occurrences of each word in a data set Reduce Tasks Map Tasks

Summary : Hadoop vs. Typical DB

CS525 : Big Data Analytics

CS525 : Big Data Analytics

Presentation Transcript

Coded Data

Advanced Analytics

Big Data and Clouds: Computing, Analytics and Curriculum

Text Analytics Workshop

Building Enterprise Bussines Intelligence

Alteryx Strategic Analytics

Maximizing Network and Storage Performance for Big Data Analytics

Maximizing Network and Storage Performance for Big Data Analytics

The CDW Data Lifecycle - Internals, Data Flows, and Business Intelligence

Cloud Services for Big Data Analytics

Digital Marketing

Spark - Shark Data Analytics Stack on a Hadoop Cluster

Text Extraction from Big Data

Multilingual Digital Forensics and Text Analytics

Data analytics company

807 - TEXT ANALYTICS

The Importance of Data Analytics in a Physician Practice

Business Analytics for Managers - Taking Business Intelligence Beyond reporting

Sourcing Analytics Training

2014 GSPIA Amazing Analytics Race

Chapter 2: Basics of Business Analytics