1 / 29

Hadoop on Azure 101 What is the Big Deal?

Hadoop on Azure 101 What is the Big Deal?. Dennis Mulder Solution Architect – Global Windows Azure Center of Excellence Microsoft Corporation. Agenda. Why Big Data? Understanding the Basics Microsoft and Hadoop. Why Big Data ?. 1.8 ZETTABYTES. Of Information will be created in 2011

tangia
Download Presentation

Hadoop on Azure 101 What is the Big Deal?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop on Azure 101What is the Big Deal? Dennis Mulder Solution Architect – Global Windows Azure Center of Excellence Microsoft Corporation

  2. Agenda Why Big Data? Understanding the Basics Microsoft and Hadoop

  3. Why Big Data?

  4. 1.8 ZETTABYTES • Of Information will be created in 2011 • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011

  5. 7.9 ZETTABYTES By 2015 • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011

  6. Bing ingests > 7 petabytes a month The Twitter community generates over 1 terabyte of tweets every day Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp

  7. Example Scenarios

  8. The Potential: Solving Specific Industry Problems eCommerce: mining web logs: collaborative filtering, user experience optimisation… Manufacturing: detecting trends and anomalies in sensor data: predicting and understanding faults Capital Markets: joining market and external data: correlation detection for investment strategy identification, risk calculations… Retail Banking: historical transaction mining: fraud detection, customer segmentation…  Industry-specific data-sets leveraged to improve decision making and generate new revenue streams

  9. Traditional E-Commerce Data Flow OPERATIONAL DATA NEW USER REGISTRY NEW PURCHASE NEW PRODUCT Data Warehouse ETL Some Data Logs Excess Data

  10. New E-Commerce Big Data Flow OPERATIONAL DATA NEW USER REGISTRY NEW PURCHASE NEW PRODUCT Data Warehouse Logs Logs Raw Data “Store it All” Cluster How much do views for certain products increase when our TV ads run? Raw Data “Store it All” Cluster

  11. Understanding the Basics Move the Compute to the Data

  12. So How Does It Work? FIRST, STORE THE DATA Server Server Files Server Server

  13. So How Does It Work? SECOND, TAKE THE PROCESSING TO THE DATA RUNTIME // Map Reduce function in JavaScript varmap = function (key, value, context) { varwords = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++){ if (words[i] !== "") context.write(words[i].toLowerCase(), 1);} }}; varreduce = function (key, values, context) { varsum = 0; while (values.hasNext()){ sum += parseInt(values.next()); } context.write(key, sum); }; Code Server Server Server Server

  14. MapReduce – Workflow A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner The framework sorts the outputs of the maps, which are then input to the reducetasks The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks Input Domain Map Map Map IntermediateDomain IntermediateDomain IntermediateDomain Reduce Reduce Reduce IntermediateDomain Reduce Outputdomain

  15. Map Scenario: Get sum sales grouped by zipCode (custId, zipCode, amount) DataNode3 5 6 7 5 2 9 8 3 6 5 0 6 0 5 9 6 8 2 3 1 4 4 7 1 02115 53705 54235 53705 53705 53705 02115 44313 10025 54235 10025 44313 44313 02115 53705 44313 10025 10025 53705 53705 44313 02115 53705 02115 54235 44313 10025 54235 10025 54235 44313 54235 53705 44313 02115 44313 $75 $55 $95 $55 $22 $30 $15 $25 $10 $60 $15 $60 $15 $15 $60 $25 $75 $15 $30 $15 $15 $55 $10 $22 $15 $22 $15 $95 $25 $95 $65 $75 $10 $65 $65 $30 Group By Mapper One output bucket per reduce task Blocks of the Sales file in HDFS DataNode2 Group By Mapper DataNode1 Map tasks

  16. Reducer Reduce SUM SUM SUM Sort Sort Sort Mapper • Done! Reducer 02115 54235 10025 44313 53705 $30 $97 $90 $155 53705 44313 53705 02115 53705 44313 02115 10025 44313 53705 44313 54235 44313 53705 54235 53705 54235 10025 53705 02115 02115 53705 53705 54235 10025 44313 44313 44313 10025 10025 54235 44313 02115 02115 54235 10025 $110 $25 $60 $75 $22 $30 $95 $65 $15 $15 $25 $65 $10 $30 $15 $15 $10 $15 $55 $75 $10 $75 $60 $30 $22 $25 $95 $60 $15 $95 $22 $65 $15 $15 $15 $55 $55 Shuffle Reducer Mapper Reduce tasks

  17. Hadoop

  18. Hadoop Architecture Task tracker Task tracker MapReduce Layer Job tracker Name node HDFS Layer Data node Data node Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png

  19. Traditional RDBMS vs. MapReduce • Reference: Tom White’s Hadoop: The Definitive Guide

  20. The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Zookeepr (Coordination) Pig (Data Flow) Hive (SQL) Sqoop Avro (Serialization) MapReduce(Job Scheduling/ Execution System) Hbase (Column DB) HDFS(Hadoop Distributed File System) • Reference: Tom White’s Hadoop: The Definitive Guide

  21. Microsoft and Hadoop

  22. Azure Blob Storage Hadoop on Azure Name Node • On Premise Enterprise Content • Transactional DBs • On Prem logs • Internal sensors Azure Blob Storage Data Node Data Node Azure Blob Storage Data Node Data Node SQL Azure HDFS • Cloud Enterprise Content • Generated in Azure Application end point S3 • Generated/stored elsewhere • What does Hadoop in the Cloud mean? • Where is HDFS? • Where is my data stored? • Azure Blob Storage vs. HDFS • 3rd Party Content • Azure Datamarket • Public content • Delivered online

  23. Detailed Offerings INSIGHTS Hive ODBC Driver & Hive Add-in for Excel Integration with Microsoft PowerPivot Hadoop based distribution for Windows Server & Azure Strategic Partnership with Hortonworks ENTERPRISE READY JavaScript framework for Hadoop RTM of Hadoop connectors for SQL Server and PDW BROADER ACCESS

  24. Microsoft Big Data Solution FAMILIAR END USER TOOLS Excel with PowerPivot Power View Predictive Analytics Embedded BI BI PLATFORM SSAS SSRS Microsoft EDW Connectors Hadoop On Windows Azure Hadoop On Windows Server UNSTRUCTURED & STRUCTURED DATA Sensors Devices Bots Crawlers ERP CRM LOB APPs

  25. Deploying and Interacting With a Hadoop Cluster on Azure demo

  26. Hadoop on WindowsInsights to all users by activating new types of data Differentiation INSIGHTS Integrate with Microsoft Business Intelligence Choice of deployment on Windows Server + Windows Azure Integrate with Windows Components (AD, Systems Center) ENTERPRISE READY Easy installation and configuration of Hadoop on Windows Simplified programming with . Net & Javascript integration Integrate with SQL Server Data Warehousing BROADER ACCESS • Contributions proposed back to community distribution

  27. Summary Hadoop is about massive compute and massive data The code is brought to the data Map -> Split the work Reduce -> Combine the results Relational databases vsHadoop? Wrong question - Serve different needs

  28. Resources http://www.hadooponazure.com/ http://hadoop.apache.org/

More Related