Hadoop on Azure 101 What is the Big Deal?

Hadoop on Azure 101What is the Big Deal? Dennis Mulder Solution Architect – Global Windows Azure Center of Excellence Microsoft Corporation

Agenda Why Big Data? Understanding the Basics Microsoft and Hadoop

Why Big Data?

1.8 ZETTABYTES • Of Information will be created in 2011 • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011

7.9 ZETTABYTES By 2015 • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011

Bing ingests > 7 petabytes a month The Twitter community generates over 1 terabyte of tweets every day Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp

Example Scenarios

The Potential: Solving Specific Industry Problems eCommerce: mining web logs: collaborative filtering, user experience optimisation… Manufacturing: detecting trends and anomalies in sensor data: predicting and understanding faults Capital Markets: joining market and external data: correlation detection for investment strategy identification, risk calculations… Retail Banking: historical transaction mining: fraud detection, customer segmentation…  Industry-specific data-sets leveraged to improve decision making and generate new revenue streams

Traditional E-Commerce Data Flow OPERATIONAL DATA NEW USER REGISTRY NEW PURCHASE NEW PRODUCT Data Warehouse ETL Some Data Logs Excess Data

New E-Commerce Big Data Flow OPERATIONAL DATA NEW USER REGISTRY NEW PURCHASE NEW PRODUCT Data Warehouse Logs Logs Raw Data “Store it All” Cluster How much do views for certain products increase when our TV ads run? Raw Data “Store it All” Cluster

Understanding the Basics Move the Compute to the Data

So How Does It Work? FIRST, STORE THE DATA Server Server Files Server Server

So How Does It Work? SECOND, TAKE THE PROCESSING TO THE DATA RUNTIME // Map Reduce function in JavaScript varmap = function (key, value, context) { varwords = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++){ if (words[i] !== "") context.write(words[i].toLowerCase(), 1);} }}; varreduce = function (key, values, context) { varsum = 0; while (values.hasNext()){ sum += parseInt(values.next()); } context.write(key, sum); }; Code Server Server Server Server

MapReduce – Workflow A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner The framework sorts the outputs of the maps, which are then input to the reducetasks The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks Input Domain Map Map Map IntermediateDomain IntermediateDomain IntermediateDomain Reduce Reduce Reduce IntermediateDomain Reduce Outputdomain

Map Scenario: Get sum sales grouped by zipCode (custId, zipCode, amount) DataNode3 5 6 7 5 2 9 8 3 6 5 0 6 0 5 9 6 8 2 3 1 4 4 7 1 02115 53705 54235 53705 53705 53705 02115 44313 10025 54235 10025 44313 44313 02115 53705 44313 10025 10025 53705 53705 44313 02115 53705 02115 54235 44313 10025 54235 10025 54235 44313 54235 53705 44313 02115 44313 $75 $55 $95 $55 $22 $30 $15 $25 $10 $60 $15 $60 $15 $15 $60 $25 $75 $15 $30 $15 $15 $55 $10 $22 $15 $22 $15 $95 $25 $95 $65 $75 $10 $65 $65 $30 Group By Mapper One output bucket per reduce task Blocks of the Sales file in HDFS DataNode2 Group By Mapper DataNode1 Map tasks

Reducer Reduce SUM SUM SUM Sort Sort Sort Mapper • Done! Reducer 02115 54235 10025 44313 53705 $30 $97 $90 $155 53705 44313 53705 02115 53705 44313 02115 10025 44313 53705 44313 54235 44313 53705 54235 53705 54235 10025 53705 02115 02115 53705 53705 54235 10025 44313 44313 44313 10025 10025 54235 44313 02115 02115 54235 10025 $110 $25 $60 $75 $22 $30 $95 $65 $15 $15 $25 $65 $10 $30 $15 $15 $10 $15 $55 $75 $10 $75 $60 $30 $22 $25 $95 $60 $15 $95 $22 $65 $15 $15 $15 $55 $55 Shuffle Reducer Mapper Reduce tasks

Hadoop

Hadoop Architecture Task tracker Task tracker MapReduce Layer Job tracker Name node HDFS Layer Data node Data node Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png

Traditional RDBMS vs. MapReduce • Reference: Tom White’s Hadoop: The Definitive Guide

The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Zookeepr (Coordination) Pig (Data Flow) Hive (SQL) Sqoop Avro (Serialization) MapReduce(Job Scheduling/ Execution System) Hbase (Column DB) HDFS(Hadoop Distributed File System) • Reference: Tom White’s Hadoop: The Definitive Guide

Microsoft and Hadoop

Azure Blob Storage Hadoop on Azure Name Node • On Premise Enterprise Content • Transactional DBs • On Prem logs • Internal sensors Azure Blob Storage Data Node Data Node Azure Blob Storage Data Node Data Node SQL Azure HDFS • Cloud Enterprise Content • Generated in Azure Application end point S3 • Generated/stored elsewhere • What does Hadoop in the Cloud mean? • Where is HDFS? • Where is my data stored? • Azure Blob Storage vs. HDFS • 3rd Party Content • Azure Datamarket • Public content • Delivered online

Detailed Offerings INSIGHTS Hive ODBC Driver & Hive Add-in for Excel Integration with Microsoft PowerPivot Hadoop based distribution for Windows Server & Azure Strategic Partnership with Hortonworks ENTERPRISE READY JavaScript framework for Hadoop RTM of Hadoop connectors for SQL Server and PDW BROADER ACCESS

Microsoft Big Data Solution FAMILIAR END USER TOOLS Excel with PowerPivot Power View Predictive Analytics Embedded BI BI PLATFORM SSAS SSRS Microsoft EDW Connectors Hadoop On Windows Azure Hadoop On Windows Server UNSTRUCTURED & STRUCTURED DATA Sensors Devices Bots Crawlers ERP CRM LOB APPs

Deploying and Interacting With a Hadoop Cluster on Azure demo

Hadoop on WindowsInsights to all users by activating new types of data Differentiation INSIGHTS Integrate with Microsoft Business Intelligence Choice of deployment on Windows Server + Windows Azure Integrate with Windows Components (AD, Systems Center) ENTERPRISE READY Easy installation and configuration of Hadoop on Windows Simplified programming with . Net & Javascript integration Integrate with SQL Server Data Warehousing BROADER ACCESS • Contributions proposed back to community distribution

Summary Hadoop is about massive compute and massive data The code is brought to the data Map -> Split the work Reduce -> Combine the results Relational databases vsHadoop? Wrong question - Serve different needs

Resources http://www.hadooponazure.com/ http://hadoop.apache.org/

Hadoop on Azure 101 What is the Big Deal?

Hadoop on Azure 101 What is the Big Deal?

Presentation Transcript

Download the Powerpoint Presentation - PowerPoint Presentation

talk-ppt - PowerPoint Presentation

Breaking the Big Deal

New Deal: Big Deal!

“Introducing Hadoop on Azure:

Big Data and Hadoop On Windows

Hadoop 101

What is the big deal with weather and climate?

What is the Big Deal about Germs?

What’s the Big Deal?

What’s the Big deal?

What’s the Big Deal?

What’s the Big Deal?

“Introducing Hadoop on Azure:

This is a PowerPoint presentation on the fundamentals

Dosing Chemotherapy in Obese Patients: What is the BIG Deal?

The Big Deal About Big Data

PREVENTION CERTIFICATION: WHAT IS THE BIG DEAL?

What’s the BIG DEAL?

What is PowerPoint?

Strengths Perspective What ’ s The Big Deal?

The Big Deal…A Good Deal!