290 likes | 452 Views
Hadoop on Azure 101 What is the Big Deal?. Dennis Mulder Solution Architect – Global Windows Azure Center of Excellence Microsoft Corporation. Agenda. Why Big Data? Understanding the Basics Microsoft and Hadoop. Why Big Data ?. 1.8 ZETTABYTES. Of Information will be created in 2011
E N D
Hadoop on Azure 101What is the Big Deal? Dennis Mulder Solution Architect – Global Windows Azure Center of Excellence Microsoft Corporation
Agenda Why Big Data? Understanding the Basics Microsoft and Hadoop
1.8 ZETTABYTES • Of Information will be created in 2011 • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011
7.9 ZETTABYTES By 2015 • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011
Bing ingests > 7 petabytes a month The Twitter community generates over 1 terabyte of tweets every day Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
The Potential: Solving Specific Industry Problems eCommerce: mining web logs: collaborative filtering, user experience optimisation… Manufacturing: detecting trends and anomalies in sensor data: predicting and understanding faults Capital Markets: joining market and external data: correlation detection for investment strategy identification, risk calculations… Retail Banking: historical transaction mining: fraud detection, customer segmentation… Industry-specific data-sets leveraged to improve decision making and generate new revenue streams
Traditional E-Commerce Data Flow OPERATIONAL DATA NEW USER REGISTRY NEW PURCHASE NEW PRODUCT Data Warehouse ETL Some Data Logs Excess Data
New E-Commerce Big Data Flow OPERATIONAL DATA NEW USER REGISTRY NEW PURCHASE NEW PRODUCT Data Warehouse Logs Logs Raw Data “Store it All” Cluster How much do views for certain products increase when our TV ads run? Raw Data “Store it All” Cluster
So How Does It Work? FIRST, STORE THE DATA Server Server Files Server Server
So How Does It Work? SECOND, TAKE THE PROCESSING TO THE DATA RUNTIME // Map Reduce function in JavaScript varmap = function (key, value, context) { varwords = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++){ if (words[i] !== "") context.write(words[i].toLowerCase(), 1);} }}; varreduce = function (key, values, context) { varsum = 0; while (values.hasNext()){ sum += parseInt(values.next()); } context.write(key, sum); }; Code Server Server Server Server
MapReduce – Workflow A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner The framework sorts the outputs of the maps, which are then input to the reducetasks The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks Input Domain Map Map Map IntermediateDomain IntermediateDomain IntermediateDomain Reduce Reduce Reduce IntermediateDomain Reduce Outputdomain
Map Scenario: Get sum sales grouped by zipCode (custId, zipCode, amount) DataNode3 5 6 7 5 2 9 8 3 6 5 0 6 0 5 9 6 8 2 3 1 4 4 7 1 02115 53705 54235 53705 53705 53705 02115 44313 10025 54235 10025 44313 44313 02115 53705 44313 10025 10025 53705 53705 44313 02115 53705 02115 54235 44313 10025 54235 10025 54235 44313 54235 53705 44313 02115 44313 $75 $55 $95 $55 $22 $30 $15 $25 $10 $60 $15 $60 $15 $15 $60 $25 $75 $15 $30 $15 $15 $55 $10 $22 $15 $22 $15 $95 $25 $95 $65 $75 $10 $65 $65 $30 Group By Mapper One output bucket per reduce task Blocks of the Sales file in HDFS DataNode2 Group By Mapper DataNode1 Map tasks
Reducer Reduce SUM SUM SUM Sort Sort Sort Mapper • Done! Reducer 02115 54235 10025 44313 53705 $30 $97 $90 $155 53705 44313 53705 02115 53705 44313 02115 10025 44313 53705 44313 54235 44313 53705 54235 53705 54235 10025 53705 02115 02115 53705 53705 54235 10025 44313 44313 44313 10025 10025 54235 44313 02115 02115 54235 10025 $110 $25 $60 $75 $22 $30 $95 $65 $15 $15 $25 $65 $10 $30 $15 $15 $10 $15 $55 $75 $10 $75 $60 $30 $22 $25 $95 $60 $15 $95 $22 $65 $15 $15 $15 $55 $55 Shuffle Reducer Mapper Reduce tasks
Hadoop Architecture Task tracker Task tracker MapReduce Layer Job tracker Name node HDFS Layer Data node Data node Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png
Traditional RDBMS vs. MapReduce • Reference: Tom White’s Hadoop: The Definitive Guide
The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Zookeepr (Coordination) Pig (Data Flow) Hive (SQL) Sqoop Avro (Serialization) MapReduce(Job Scheduling/ Execution System) Hbase (Column DB) HDFS(Hadoop Distributed File System) • Reference: Tom White’s Hadoop: The Definitive Guide
Azure Blob Storage Hadoop on Azure Name Node • On Premise Enterprise Content • Transactional DBs • On Prem logs • Internal sensors Azure Blob Storage Data Node Data Node Azure Blob Storage Data Node Data Node SQL Azure HDFS • Cloud Enterprise Content • Generated in Azure Application end point S3 • Generated/stored elsewhere • What does Hadoop in the Cloud mean? • Where is HDFS? • Where is my data stored? • Azure Blob Storage vs. HDFS • 3rd Party Content • Azure Datamarket • Public content • Delivered online
Detailed Offerings INSIGHTS Hive ODBC Driver & Hive Add-in for Excel Integration with Microsoft PowerPivot Hadoop based distribution for Windows Server & Azure Strategic Partnership with Hortonworks ENTERPRISE READY JavaScript framework for Hadoop RTM of Hadoop connectors for SQL Server and PDW BROADER ACCESS
Microsoft Big Data Solution FAMILIAR END USER TOOLS Excel with PowerPivot Power View Predictive Analytics Embedded BI BI PLATFORM SSAS SSRS Microsoft EDW Connectors Hadoop On Windows Azure Hadoop On Windows Server UNSTRUCTURED & STRUCTURED DATA Sensors Devices Bots Crawlers ERP CRM LOB APPs
Deploying and Interacting With a Hadoop Cluster on Azure demo
Hadoop on WindowsInsights to all users by activating new types of data Differentiation INSIGHTS Integrate with Microsoft Business Intelligence Choice of deployment on Windows Server + Windows Azure Integrate with Windows Components (AD, Systems Center) ENTERPRISE READY Easy installation and configuration of Hadoop on Windows Simplified programming with . Net & Javascript integration Integrate with SQL Server Data Warehousing BROADER ACCESS • Contributions proposed back to community distribution
Summary Hadoop is about massive compute and massive data The code is brought to the data Map -> Split the work Reduce -> Combine the results Relational databases vsHadoop? Wrong question - Serve different needs
Resources http://www.hadooponazure.com/ http://hadoop.apache.org/