HDInsight on Azure and Map-Reduce. Richard Conway Windows Azure MVP Elastacloud Limited. Agenda. Introduction Big Data with HDInsight. Introduction. Solving problems through distribution.
An Image/Link below is provided (as is) to download presentationDownload Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.Content is provided to you AS IS for your information and personal use only. Download presentation by click this link.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.During download, if you can't get a presentation, the file might be deleted by the publisher.
E N D
Presentation Transcript
HDInsight on Azure and Map-Reduce
Richard Conway Windows Azure MVP Elastacloud Limited
Agenda Introduction Big Data with HDInsight
Introduction
Solving problems through distribution Some challenges become bound by hardware capacity; 24 hours on 1 machine can be 1 hours on 24 machines. These 24 machines require orchestration; jobs are to be divided into tasks and tasks are distributed across a cluster. There are systems of software required to facilitate the distribution; examples are Hadoop and HPC Server. We will now provision a Hadoop cluster on Windows Azure.
Big Data vs Big Compute
Hadoop HPC Server Open MPI
All distributed compute works on the basis of taking a large JOB and breaking it to many smaller TASKS which are then parallelised
HPC Head Node Broker Node Worker Nodes Hadoop Name Node Name Node Data Nodes
Understanding Big Data
KEY TRENDS Device Explosion Social Networks Cheap Storage $100 gets you 3million times more storage in 30 years) >5.5 billion (70+% of global population) >2Billion users Ubiquitous Connection Sensor Networks Inexpensive Computing Web traffic 2010130 Exabyte (10 E18) 20151.6 ZettaByte (10 E21) >10 Billion 1980 10 MIPS/$ 200510M MIPS/$
What is Big Data? Internet of things Social Sentiment Wikis / Blogs Exabytes (10E18) Sensors / RFID / Devices Click Stream Audio / Video WEB 2.0 Log Files Mobile Petabytes (10E15) Advertising eCommerce Collaboration Spatial & GPS Coordinates Volume ERP / CRM Digital Marketing Data Market Feeds Terabytes (10E12) Search Marketing eGov Feeds Contacts Payables Web Logs Weather Deal Tracking Payroll Gigabytes (10E9) Sales Pipeline Inventory Recommendations Text/Image Velocity - Variety - variability Internet of things WEB 2.0 ERP / CRM 1990 9,000$ 2000 15$ 2010 0.07$ 1980 190,000$ Storage/GB
Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions Software Growth 34% compound annual growth rate2 Services Growth 39% compound annual growth rate2 49% CEOs and CIOs are planning big data projects McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 2012 IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012
Devices: Internet and Internet of things Internet of things Trillions of computer-enabled devices which are part of the IoT 100kBit/sec Low bandwidth last-mile connection Trillions of networked nodes Invisible devices 6+billion people 1.5 billion use net US: 4.3 devices per adult Cable: 10Mbs+ Fiber: 50-100Mbs Laptops / tablets / smartphones High-bandwidth access Mostly addressed by local schemes Machine-centric Sensing-focus Billions of networked devices Internet Global addressing User-centric Communication-focus
Big Data Scenarios
Short History of Hadoop Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scale Hadoop started as a part of the Nutch project. In Jan 2006 Doug Cutting started working on Hadoop at Yahoo Factored out of Nutch in Feb 2006 First release of Apache Hadoopin September 2007 Jan 2008 Hadoop became a top level Apache project
Hadoop Distributed Architecture Task tracker Task tracker MapReduce Layer Job tracker Name node HDFS Layer Data node Data node Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png
MapReduce: Move Code to the Data FIRST, STORE THE DATA Server Server Files Server Server
So How Does It Work? SECOND, TAKE THE PROCESSING TO THE DATA RUNTIME // Map Reduce function in JavaScript varmap = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") context.write(words[i].toLowerCase(), 1);} }}; varreduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; Code Server Server Server Server
Traditional RDBMS vs. NoSQL Reference: Tom White’s Hadoop: The Definitive Guide
Windows Azure HDInsight Service
Creating an HDInsightCluster Demo
HDINSIGHT / HADOOP Eco-System Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages JavaScript C#, F#, .NET Data Integration ( ODBC / SQOOP/ REST) Relational (SQL Server) Stats processing (RHadoop) Machine Learning (Mahout) Pipeline / workflow (Oozie) Graph (Pegasus) PDW Polybase Metadata (HCatalog) Event Driven Processing Query (Hive) Scripting (Pig) NoSQL Database (HBase) Event Pipeline (Flume) Distributed Processing (MapReduce) Business Intelligence (Excel, Power View, SSAS) Distributed Storage (HDFS) Active Directory (Security) Monitoring & Deployment (System Center) World's Data (Azure Data Marketplace) Azure Storage Vault (ASV)
Storing Data with HDInsight
HDFS on Azure: Tale of two File Systems HDFS API Azure Blob Storage Name Node de Front end Front end Front end Partition Layer Data Node Data Node Stream Layer … DFS (1 Data Node per Worker Role) and Compute Cluster Azure Storage (ASV)
Azure Storage (ASV) Default file system for HDInsight Service Provides sharable, persistent, highly-scalable Storage with high availability (Azure Blob Store) Azure storage itself does not provide compute Fast access from compute nodes to data in same data center Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path> Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>
Map Reduce
Examples in C#
Map/Reduce Map/Reduce is a programming model for efficient distributed computing Input > Map> Shuffle & Sort > Reduce > Output Efficiency from Streaming through data, reducing seeks A good fit for a lot of applications Log processing Web index building Data mining and machine learning
Hadoop SDK C# integration Remote Data & Jobs Hive in C# Serialization