Mastering Big Data Challenges with Hadoop and MapReduce

A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial

Data Management and Processing • Data intensive computing • Concerns with the production, manipulation and analysis of data in the range of hundreds of megabytes (MB) to petabytes (PB) and beyond • A range of supporting parallel and distributed computing technologies to deal with the challenges of data representation, reliable shared storage, efficient algorithms and scalable infrastructure to perform analysis

Challenges Ahead • Challenges with data intensive computing • Scalable algorithms that can search and process massive datasets • New metadata management technologies that can scale to handle complex, heterogeneous and distributed data sources • Support for accessing in-memory multi-terabyte data structures • High performance, highly reliable petascale distributed file system • Techniques for data reduction and rapid processing • Software mobility to move computation where data is located • Hybrid interconnect with support for multi-gigabyte data streams • Flexible and high performance software integration technique • Hadoop • A family of related project, best known for MapReduce and Hadoop Distributed File System (HDFS)

Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

Why Hadoop • Drivers • 500M+ unique users per month • Billions of interesting events per day • Data analysis is key • Need massive scalability • PB’s of storage, millions of files, 1000’s of nodes • Need to do this cost effectively • Use commodity hardware • Share resources among multiple projects • Provide scale when needed • Need reliable infrastructure • Must be able to deal with failures – hardware, software, networking • Failure is expected rather than exceptional • Transparent to applications • very expensive to build reliability into each application • The Hadoop infrastructure provides these capabilities

Introduction to Hadoop • Apache Hadoop • Based on 2004 Google MapReduce Paper • Originally composed of HDFS (distributed F/S), a core-runtime and an implementation of Map-Reduce • Open Source – Apache Foundation project • Yahoo! is Apache Platinum Sponsor • History • Started in 2005 by Doug Cutting • Yahoo! became the primary contributor in 2006 • Yahoo! scaled it from 20 node clusters to 4000 node clusters today • Portable • Written in Java • Runs on commodity hardware • Linux, Mac OS/X, Windows, and Solaris

HPC vs Hadoop • HPC attitude – “The problem of disk-limited, loosely-coupled data analysis was solved by throwing more disks and using weak scaling” • Flip-side: A single novice developer can write real, scalable, 1000+ node data-processing tasks in Hadoop-family tools in an afternoon • MPI... less so Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

Everything is converging – 1/2 Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

Everything is converging – 2/2 Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

Big Data Analytics Stack Amir Payberah https://www.sics.se/~amir/dic.htm

Big Data – Storage (sans POSIX) Amir Payberah https://www.sics.se/~amir/dic.htm

Big Data - Databases Amir Payberah https://www.sics.se/~amir/dic.htm

Big Data – Resource Management Amir Payberah https://www.sics.se/~amir/dic.htm

YARN – 1/3 • To address Hadoop v1 deficiencies with scalability, memory usage and synchronization, the Yet Another Resource Negotiator (YARN) Apache sub-project was started • Previously a JobTracker service ran on each node. Its roles were then split into separate daemons for • Resource management • Job scheduling/monitoring Hortonworks http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

YARN – 2/3 • YARN splits the JobTracker’s responsibilities into • Resource management – the global Resource Manager daemon • Per application Application Master • The resource manger and per-node slave Node Managers allow generic node management • The resource manager has a pluggable scheduler Hortonworks http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

YARN – 3/3 • The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network • The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. • The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container. Hortonworks http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

Big Data – Execution Engine Amir Payberah https://www.sics.se/~amir/dic.htm

Big Data – Query/Scripting Languages Amir Payberah https://www.sics.se/~amir/dic.htm

Big Data – Stream Processing Amir Payberah https://www.sics.se/~amir/dic.htm

Big Data – Graph Processing Amir Payberah https://www.sics.se/~amir/dic.htm

Big Data – Machine Learning Amir Payberah https://www.sics.se/~amir/dic.htm

Hadoop Big Data Analytics Stack Amir Payberah https://www.sics.se/~amir/dic.htm

Spark Big Data Analytics Stack Amir Payberah https://www.sics.se/~amir/dic.htm

Hadoop Ecosystem Hortonworks http://hortonworks.com/industry/manufacturing/

Hadoop Ecosystem • 2008 onwards – usage exploded • Creation of many tools on top of Hadoop infrastructure

The Need For Filesystems Amir Payberah https://www.sics.se/~amir/dic.htm

Distributed Filesystems Amir Payberah https://www.sics.se/~amir/dic.htm

Hadoop Distributed File System (HDFS) • A distributed file system designed to run on commodity hardware • HDFS was originally built as infrastructure for the Apache Nutch web search engine project, with the aim to achieve fault tolerance, ability to run on low-cost hardware and handle large datasets • It is now an Apache Hadoop subproject • Share similarities with existing distributed file systems and supports traditional hierarchical file organization • Reliable data replication and accessible via Web interface and Shell commands • Benefits: Fault tolerant, high throughput, streaming data access, robustness and handling of large data sets • HDFS is not a general purpose F/S

Assumptions and Goals • Hardware failures • Detection of faults, quick and automatic recovery • Streaming data access • Designed for batch processing rather than interactive use by users • Large data sets • Applications that run on HDFS have large data sets, typically in gigabytes to terabytes in size • Optimized for batch reads rather than random reads • Simple coherency model • Applications need a write-once, read-many times access model for files • Computation migration • Computation is moved closer to where data is located • Portability • Easily portable between heterogeneous hardware and software platforms

What HDFS is not good for Amir Payberah https://www.sics.se/~amir/dic.htm

HDFS Architecture • The Hadoop Distributed File System (HDFS) • Offers a way to store large files across multiple machines, rather than requiring a single machine to have disk capacity equal to/greater than the summed total size of the files • HDFS is designed to be fault-tolerant • Using data replication and distribution of data • When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data • These blocks are stored across the cluster nodes designated for storage, a.k.a. DataNodes. http://www.revelytix.com/?q=content/hadoop-ecosystem

Files and Blocks – 1/3 Amir Payberah https://www.sics.se/~amir/dic.htm

HDFS Daemons • HDFS cluster is manager by three types of processes • Namenode • Manages the filesystem, e.g., namespace, meta-data, and file blocks • Metadata is stored in memory • Datanode • Stores and retrieves data blocks • Reports to Namenode • Runs on many machines • Secondary Namenode • Only for checkpointing. • Not a backup for Namenode Amir Payberah https://www.sics.se/~amir/dic.htm

Hadoop Server Roles http://www.revelytix.com/?q=content/hadoop-ecosystem

NameNode – 1/3 • The HDFS namespace is a hierarchy of files and directories • These are represented in the NameNode using inodes • Inodes record attributes • permissions, modification and access times; • namespace and disk space quotas. • The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file) • The NameNode maintains the namespace tree and the mapping of blocks to DataNodes • A Hadoop cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently http://www.revelytix.com/?q=content/hadoop-ecosystem

Mastering Big Data Challenges with Hadoop and MapReduce

Mastering Big Data Challenges with Hadoop and MapReduce

Presentation Transcript