770 likes | 786 Views
Explore the world of big data intensive computing, scalable algorithms, metadata management, and Hadoop technologies to address massive dataset manipulation challenges efficiently. Learn about the reliable petascale distributed file system, data reduction techniques, and efficient software integration. Discover how Hadoop infrastructure provides massive scalability, reliable infrastructure, and transparent applications. Uncover the convergence of high-performance computing (HPC) and data processing through Hadoop innovations. Stay ahead in the data analytics realm with expert insights and latest developments.
E N D
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial
Data Management and Processing • Data intensive computing • Concerns with the production, manipulation and analysis of data in the range of hundreds of megabytes (MB) to petabytes (PB) and beyond • A range of supporting parallel and distributed computing technologies to deal with the challenges of data representation, reliable shared storage, efficient algorithms and scalable infrastructure to perform analysis
Challenges Ahead • Challenges with data intensive computing • Scalable algorithms that can search and process massive datasets • New metadata management technologies that can scale to handle complex, heterogeneous and distributed data sources • Support for accessing in-memory multi-terabyte data structures • High performance, highly reliable petascale distributed file system • Techniques for data reduction and rapid processing • Software mobility to move computation where data is located • Hybrid interconnect with support for multi-gigabyte data streams • Flexible and high performance software integration technique • Hadoop • A family of related project, best known for MapReduce and Hadoop Distributed File System (HDFS)
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Why Hadoop • Drivers • 500M+ unique users per month • Billions of interesting events per day • Data analysis is key • Need massive scalability • PB’s of storage, millions of files, 1000’s of nodes • Need to do this cost effectively • Use commodity hardware • Share resources among multiple projects • Provide scale when needed • Need reliable infrastructure • Must be able to deal with failures – hardware, software, networking • Failure is expected rather than exceptional • Transparent to applications • very expensive to build reliability into each application • The Hadoop infrastructure provides these capabilities
Introduction to Hadoop • Apache Hadoop • Based on 2004 Google MapReduce Paper • Originally composed of HDFS (distributed F/S), a core-runtime and an implementation of Map-Reduce • Open Source – Apache Foundation project • Yahoo! is Apache Platinum Sponsor • History • Started in 2005 by Doug Cutting • Yahoo! became the primary contributor in 2006 • Yahoo! scaled it from 20 node clusters to 4000 node clusters today • Portable • Written in Java • Runs on commodity hardware • Linux, Mac OS/X, Windows, and Solaris
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
HPC vs Hadoop • HPC attitude – “The problem of disk-limited, loosely-coupled data analysis was solved by throwing more disks and using weak scaling” • Flip-side: A single novice developer can write real, scalable, 1000+ node data-processing tasks in Hadoop-family tools in an afternoon • MPI... less so Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Everything is converging – 1/2 Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Everything is converging – 2/2 Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Big Data Analytics Stack Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Storage (sans POSIX) Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data - Databases Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Resource Management Amir Payberah https://www.sics.se/~amir/dic.htm
YARN – 1/3 • To address Hadoop v1 deficiencies with scalability, memory usage and synchronization, the Yet Another Resource Negotiator (YARN) Apache sub-project was started • Previously a JobTracker service ran on each node. Its roles were then split into separate daemons for • Resource management • Job scheduling/monitoring Hortonworks http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
YARN – 2/3 • YARN splits the JobTracker’s responsibilities into • Resource management – the global Resource Manager daemon • Per application Application Master • The resource manger and per-node slave Node Managers allow generic node management • The resource manager has a pluggable scheduler Hortonworks http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
YARN – 3/3 • The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network • The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. • The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container. Hortonworks http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
Big Data – Execution Engine Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Query/Scripting Languages Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Stream Processing Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Graph Processing Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Machine Learning Amir Payberah https://www.sics.se/~amir/dic.htm
Hadoop Big Data Analytics Stack Amir Payberah https://www.sics.se/~amir/dic.htm
Spark Big Data Analytics Stack Amir Payberah https://www.sics.se/~amir/dic.htm
Hadoop Ecosystem Hortonworks http://hortonworks.com/industry/manufacturing/
Hadoop Ecosystem • 2008 onwards – usage exploded • Creation of many tools on top of Hadoop infrastructure
The Need For Filesystems Amir Payberah https://www.sics.se/~amir/dic.htm
Distributed Filesystems Amir Payberah https://www.sics.se/~amir/dic.htm
Hadoop Distributed File System (HDFS) • A distributed file system designed to run on commodity hardware • HDFS was originally built as infrastructure for the Apache Nutch web search engine project, with the aim to achieve fault tolerance, ability to run on low-cost hardware and handle large datasets • It is now an Apache Hadoop subproject • Share similarities with existing distributed file systems and supports traditional hierarchical file organization • Reliable data replication and accessible via Web interface and Shell commands • Benefits: Fault tolerant, high throughput, streaming data access, robustness and handling of large data sets • HDFS is not a general purpose F/S
Assumptions and Goals • Hardware failures • Detection of faults, quick and automatic recovery • Streaming data access • Designed for batch processing rather than interactive use by users • Large data sets • Applications that run on HDFS have large data sets, typically in gigabytes to terabytes in size • Optimized for batch reads rather than random reads • Simple coherency model • Applications need a write-once, read-many times access model for files • Computation migration • Computation is moved closer to where data is located • Portability • Easily portable between heterogeneous hardware and software platforms
What HDFS is not good for Amir Payberah https://www.sics.se/~amir/dic.htm
HDFS Architecture • The Hadoop Distributed File System (HDFS) • Offers a way to store large files across multiple machines, rather than requiring a single machine to have disk capacity equal to/greater than the summed total size of the files • HDFS is designed to be fault-tolerant • Using data replication and distribution of data • When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data • These blocks are stored across the cluster nodes designated for storage, a.k.a. DataNodes. http://www.revelytix.com/?q=content/hadoop-ecosystem
Files and Blocks – 1/3 Amir Payberah https://www.sics.se/~amir/dic.htm
Files and Blocks – 2/3 Amir Payberah https://www.sics.se/~amir/dic.htm
Files and Blocks – 3/3 Amir Payberah https://www.sics.se/~amir/dic.htm
HDFS Daemons • HDFS cluster is manager by three types of processes • Namenode • Manages the filesystem, e.g., namespace, meta-data, and file blocks • Metadata is stored in memory • Datanode • Stores and retrieves data blocks • Reports to Namenode • Runs on many machines • Secondary Namenode • Only for checkpointing. • Not a backup for Namenode Amir Payberah https://www.sics.se/~amir/dic.htm
Hadoop Server Roles http://www.revelytix.com/?q=content/hadoop-ecosystem
NameNode – 1/3 • The HDFS namespace is a hierarchy of files and directories • These are represented in the NameNode using inodes • Inodes record attributes • permissions, modification and access times; • namespace and disk space quotas. • The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file) • The NameNode maintains the namespace tree and the mapping of blocks to DataNodes • A Hadoop cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently http://www.revelytix.com/?q=content/hadoop-ecosystem