Hadoop and Friends

COS 497 - Cloud Computing Hadoop and Friends A review of the Cloud’s major software applications

Hadoop One of Hadoop’s developers named it after his son's toy elephant!

Hadoop ▪ Hadoopis the practical implementation in Java of Google’sMapReducemodel. - Best known and most-widely used MapReduce implementation. ▪ Hadoopis open-source, developed by Yahoo!, now distributed by the Apache Foundation. ▪ A software “industry” has grown up around Hadoop - Hadoop is now supplemented by a range ofCloud software projects, such as Pig, Hive and Zookeeper, etc.Most of these are also distributed by Apache.

Hadoop • ▪ Hadoopis a MapReduce software platform. • Provides a framework for running MapReduce applications. • - This framework understands and assigns work to the nodes in a cluster. • Handles the mapping and reduc(e)inglogistics. • Programmer just provides the custom functionality. • ▪ Currently takes custom functionality written in Java or Python. • ▪ Can use an open-source Eclipse plug‐in to interface with Hadoop.

• Hadoop framework consists on two main layers - Distributed file system (HDFS) - Distributed execution engine (MapReduce) Both the compute and storage layers of Hadoop work with the master/slave model

Hadoop: Big Picture

HDFS • ▪ Hadoopmakes use of HDFS for data storage - the file system that spans all the nodes in a Hadoop cluster. • ▪ It links together the file systems on many local nodes to make them into one big file system. • ▪ HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes

Prominent Hadoop Users Yahoo! ▪ In February, 2008, Yahoo! launched what it claimed was the world's largest Hadoop production application. - The Yahoo! Search Webmapis a Hadoop application that runs on a more than 10,000-core Linux cluster, and produces data that is used in every Yahoo! Web search query. ▪ There are multiple Hadoop clusters at Yahoo!

▪ In 2009, Yahoo! made the source code of the version of Hadoop it runs in production available to the public. ▪ Yahoo! still contributes all work it does on Hadoop to the open-source community. - The company's developers also fix bugs and provide stability improvements internally, and release this patched source code so that other users may benefit from their effort.

Facebook ▪ In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage. ▪ In July, 2011 they announced the data had grown to 30 PB. ▪ In June, 2012 they announced the data had grown to 100 PB. ▪ In November, 2012 they announced their data warehouse grows by roughly half a PB a day. Other users ▪ Besides Facebook and Yahoo!, many other (big) organizations are using Hadoop to run large distributed computations.

HadoopJob Flow

Building Blocks of Hadoop ▪ A fully configured cluster, “running Hadoop” means running a set of tasks, on the different servers in the network. ▪ These tasks have specific roles - some exist only on one server, some exist across multiple servers. • The tasks include • NameNode • Secondary • NameNode • DataNode • JobTracker • TaskTracker

Job Tracker JobTracker ▪ The JobTrackertask is the liaison between the application and Hadoop. ▪ Once code is submitted to the cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they are running. ▪ Should a task fail, the JobTracker will automatically re-launch the task, possibly on a different node, up to a predefined limit of retries. ▪ There is only one JobTrackerper Hadoop cluster. It is typically run on a server as a master node of the cluster.

Task Trackers • TaskTracker • ▪ The JobTracker is the master overseeing the overall execution of a MapReduce job and the TaskTrackers manage the execution of individual tasks on each slave node. • ▪ Each TaskTrackeris responsible for executing the individual tasks that the JobTrackerassigns. • ▪ Although there is a single TaskTrackerper slave node, each TaskTracker can spawn multiple copies to handle many Map or Reduce tasks in parallel. • ▪ One responsibility of the TaskTracker is to constantly communicate with the JobTracker. • ▪ If the JobTracker fails to receive a “heartbeat” from a TaskTracker within a specified amount of time, it will assume the TaskTrackerhas crashed and will resubmit the corresponding tasks to other nodes in the cluster.

Hadoop’s Friends

Pig and Pig Latin Pigwas developed at Yahoo! to allow people using Hadoopto focus more on analyzing large data sets, and spend less time having to write Map and Reduce programs. Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data - hence the name!

Now runs about 30% of Yahoo!’s jobs • Features • - Expresses sequences of MapReducejobs • - Provides relational (SQL-like) operators • - Easy to plug-in Java functions

Pig is made up of two components: - The first is the language itself, which is called Pig Latin and - The second is a runtime environment where Pig Latin programs are executed. (Think of the analogous relationship between a Java Virtual Machine (JVM) and a Java program.)

Pig Latin • Pig Latin is the language … • It is a hybridbetween a high-level declarative query language, such as SQL • and • a low-level procedural language such as C++, Java, Python • Like Hadoop, it is Apache open-source- mostly built by Yahoo!.

Example: Average score per category (i.e. domain) Input table: pages(url, category, score) Problem: find, for each sufficiently large category, the average score of high-scoring web pages in that category

SQL solution: • SELECT category, AVG(score) • FROM pages • WHERE score > 0.5 • GROUP BY category • HAVING COUNT(*) > 1M • Pig Latin solution: • topPages = FILTER pages BY score > 0.5; • groups = GROUP topPages BY category; • largeGroups =FILTER groups BY • COUNT(topPages) > 1M; • output =FOREACH largeGroups • GENERATE category, • AVG(topPages.score);

Look at each statement in turn …

An Example Problem • Suppose you have user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25 Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

In MapReduce … Java

In Pig Latin … Order by clicks Load Users Users = load‘users’as (name, age); Filtered = filter Users by age >= 18 andage <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined byurl; Summed = foreach Grouped generate group,count(Joined) as clicks; • Sorted = order Summed by clicks desc; • Top5 = limit Sorted 5; • storeTop5 into‘top5sites’; Take top 5 Filter by age Load Pages Join on name Group on url Count clicks

Ease of Translation Load Users Load Pages Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit … Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

Ease of Translation Load Users Load Pages Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit … Filter by age Join on name Job 1 Group on url Job 2 Count clicks Order by clicks Job 3 Take top 5

Hive Hiveis a data warehouse infrastructure built on top of Hadoopthat can compile SQL-style queries into MapReduce jobs and run these jobs on a Hadoop cluster - MapReducefor execution, HDFS for storage Key principles of Hive’s design: - SQL Syntax familiar to data analysts - Data that does not fit traditional RDBMS systems - To process terabytes and petabytes of data - Scalability and Performance

Hive Although Pig can be quite a powerful and simple language to use, the downside is that it is something extra new to learn and master. So, Facebook developed a runtime Hadoop support structure that allows anyone who is already fluent with SQL to leverage the Hadoop platform directly. Their creation, called Hive,allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements. However, HQL is limited in the commands it understands, but it is still pretty useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.

Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. 1. Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a RDBMS database. 2. Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

ZooKeeper ZooKeeper is an open-source Apache package that provides a centralized infrastructure and services that enable synchronization across a cluster of servers. ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space, and so on. Applications can leverage these services to coordinate distributed processing across large clusters.

Mahout Scalable machine learning and data mining From India: a mahout is a person who rides an elephant.

Apache Mahout has implementations of a wide range of machine learning and data mining algorithms: - Clustering, classification, collaborative filtering (CF) and frequent pattern mining Mahout’s algorithms for clustering, classification and batch-based collaborative filtering are implemented on top of Apache Hadoop using the MapReduce paradigm. Mahout allows the development of intelligent Cloud applications that learn from data and user input

Machine Learning - 101 Machine learning uses run the rangefrom game playing to fraud detection to stock-market analysis. It is used to build systems like those at Netflix and Amazon that recommend products to users based on past purchases, or systems that find all of the similar news articles on a given day. It can also be used to categorize Web pages automatically according to the genre (sports, economy, and so on) or to mark e-mail messages as spam. http://www.ibm.com/developerworks/library/j-mahout/

Collaborative filtering used in Recommender Systems, e.g. Amazon Help users find items they might like based on historical preferences

Classification Assigning data to discrete categories Train a model on labeled data Run the model on new, unlabeled data

Questions?

Hadoop and Friends

Hadoop and Friends

Presentation Transcript

Hadoop

Hadoop

MySQL and Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop @

MapReduce and Hadoop

Mapreduce and Hadoop

Hadoop

HADOOP

Hadoop

Hadoop

Hadoop

Hadoop

Introduction to Hadoop and Hadoop component

Bigdata and Hadoop