350 likes | 367 Views
Learn how to build a data lake using Microsoft Azure Data Lake and gain valuable insights through advanced analytics techniques. Discover the benefits of storing and analyzing data in its native format without pre-defined structures.
E N D
Microsoft Ignite NZ 25-28 October 2016 SKYCITY, Auckland
Azure Data Lake Julian LeeAdvanced Analytics Technical SpecialistGlobal Black Belt - ANZ Building a Data Lake and Gaining Insight [M368]
Traditionally, analytics have been over pre-defined structures BI and visualizations Data characteristics: Pre-defined logical models & schemas Limits to size, and scale Questions answered with BI and visualizations: What were my product sales in Q2, FY2015? What is my highest grossing product? Dashboards Reporting Traditional Data Warehouse Name Address Credit Card Customer Name SKU # Unit price Product Sales Date Unit Discount
To innovate, new types of data and analytics are needed Exploratory Analytics BI and visualizations Data characteristics: Data volume explosion New unstructured and semi-structured data Data that comes in real-time No schemas on write Questions from exploratory analytics: What product would Tim buy in the future? What promotions should we run for Tim to purchase? What new SKUs should we offer? Batch + Interactive Machine Learning Streaming Big Data Dashboards Reporting Traditional Data Warehouse Log files Ecommerce website clicks Sensors Petabytes IOT Social sentiment Name Address Credit Card Customer Customer Name Address Credit Card Traditional Data Warehouse Files Video Product Name SKU # Unit price RFIDs Mobile Transactions Name SKU # Unit price Product Market Data Online searching Sales Sales Date Unit Discount Date Unit Discount Data complexity: variety and velocity
Data Warehousing Uses A Top-Down Approach Understand Corporate Strategy Implement Data Warehouse Gather Requirements BI and analytic Reporting & Analytics Design Reporting & Analytics Development Business Requirements Dashboards Reporting Data warehouse Dimension Modelling Physical Design ETL ETL Design ETL Development Technical Requirements Data sources Setup Infrastructure Install and Tune OLTP ERP CRM LOB
The “data lake” Uses A Bottom-Up Approach Store all data in native format without schema definition Ingest all data regardless of requirements Do analysis Using analytic engines like Hadoop Web Web Social Social Sensors Sensors Devices Devices Batch queries LOB applications LOB applications Relational Relational Video Video Clickstream Clickstream Interactive queries Real-time analytics Machine Learning Data warehouse
Data Lake + Data Warehouse Better Together BI and analytic What happened? What is happening? Why did it happen? What are key relationships? Web Social Sensors Devices What will happen? What if? How risky is it? What should happen? What is the best option? How can I optimize? Dashboards Reporting Data warehouse LOB applications Relational Video Clickstream ETL Data sources OLTP ERP CRM LOB
Introducing Microsoft Azure Data Lake Microsoft Azure Data Lake Analytics Service • HDInsight Analytics on any data, any size U-SQL All users productive on day one YARN HDFS Ready for your enterprise Store 1 1 1 1 1 1 1 1 1 1 1 1
Any type of analytics: batch, streaming, interactive Azure Data Lake analytics Batch, interactive, streaming, machine learning Allows for exploratory analytics over your data Do analytics with Hadoop and Microsoft solutions Batch MapReduce Hive U-SQL Script NoSQL Machine LearningPySparkSparklyR Interactive Streaming Cortana Analytics Suite Any type of analytics
No limits to SCALE Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud ENTERPRISE READY access control, encryption at rest Optimized for analytic workload PERFORMANCE Azure Data Lake Store A hyper scale repository for big data analytics workloads IN PREVIEW
Customer Challenges Data Silos Perf & Scale Analytics Security Open interfaces to data Variety of analytics tools Compliance challenges Effectively control access Corporate policies Data spans sources Inefficiency in colocation Storage bottlenecks IoT sources – Small writes Price-performance Data grows independently
Hadoop Distributed File System (HDFS) For The Cloud • `` HDInsight Web Social Sensors Devices Built from the ground-up as a Hadoop File System Support for file/folder objects and operations Integrated w/ HDInsight, Hortonworks, Cloudera Accessible to all HDFS compliant projects (Spark, Storm, Flume, Sqoop, Kafka, R, etc.) LOB applications Relational Video Clickstream Microsoft Azure Data Lake Built using open standards
Manage and Secure Your Data Assets Azure Active Directory integration File and folder level access control Audit data access Encryption of data-at-rest Protect your data assets
Durable and Highly Available Automatically replicates your data 3 copies within a single region Highly available Peace of mind for data of high durability
Unlimited Storage, Petabyte Files PB GB TB Unlimited account sizes Individual file sizes from GBs to PBs No limits to scale PB TB Useful for scenarios with very large data
Optimized for Analytic Workload Performance Built for running large analytic systems that require massive throughput Optimized for massively parallel computation over PBs of data Automatically optimize for any throughput Focus only on writing application logic
High Frequency, Low Latency, Read Immediately High volumes of small writes at low latency Immediate read-after-write Optimized for scenarios such as IoT, real-time fraud detection, clickstream analysis, etc. 0110 00010 110111 10110110 111000100101 1001011010001 100010000011 11010000100000 0111101111010 111101111110000110 Good for Transaction + IOT workloads
Demo – Starting Up Azure Data Lake Storage Julian Lee
No limits to SCALE Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C# Optimized to work with ADL STORE FEDERATED QUERY across Azure data sources ENTERPRISE READY Role based access control & Auditing Pay PER JOB & Scale PER JOB Azure Data Lake Analytics A elastic analytics service built on Apache YARN that processes all data, at any size IN PREVIEW
Analytics that dynamically scale matching your needs Architected for cloud scale and performance With a few clicks, provision any amount of resources Dynamically provisions & winds down resources Focus only on your business logic Dynamically scales to match the business
On any type of data Easily incorporate any form of data Distributed query over structured, unstructured data Query over Data Lake store, Azure Storage, Azure SQL Database, Azure SQL Data Warehouse, SQL Server in Virtual Machine Azure Data Analytics Service U-SQL Azure Data Lake Store Azure Blobs SQL Server in an Azure VM Azure SQL Database Store 1 1 1 1 1 1 1 1 1 1 1 1 Any type of data
Be productive without worrying about infrastructure Deploy big data projects within minutes/seconds No hardware to install, tune, configure or deploy No infrastructure or software to manage Scale to tens to thousands of machines instantly Up and running instantly
Be productive with a robust development environment Deep integration to Visual Studio Easy for novices to write simple queries Robust environment for experts to also be productive Integrated with U-SQL, Hive, and Storm Playback that visualizes performance to identify bottlenecks and areas for optimization Productive for novices and experts
Be productive with U-SQL, a simple and powerful language U-SQL Simple and familiar, easily extensible Unifies declarative nature of SQL with expressive power of C# Familiar syntax to millions of .NET developers Empower SQL/.NET developers with big data
Introducing Microsoft Azure Data Lake Analytics on any data, any size All users productive on day one Open, reliable, secure, integrated
Reliable Managed, monitored and supported by Microsoft Enterprise-leading SLA: 99.9% uptime No IT resources needed for upgrades/patching Microsoft monitors your deployment so you don’t have to Managed, monitored, and supported by Microsoft Peace of mind that you’ll run continuously
Manage and secure your data assets Auditing, alerting, access control - all from within a single web-based portal Azure Active Directory integration for identity and access management Use existing IT investment for security
Lower Total Cost of Ownership No hardware licenses or service-specific support agreement Pay only for what you use and not more than you need Independently scale storage and compute No need to hire specialized operations team to do big data 63% lower total cost of ownership than on-premises* Lower Total Cost of Ownership *Pending IDC study found on a per TB basis, Microsoft customers using cloud-based Hadoop in Data Lake have a 63% lower TCO than on-premises
Proven at Microsoft Used at Microsoft for some of the largest big data projects across Office, Xbox Live, Azure, Windows, Bing and Skype Over ten thousand developers Exabytes of data under management Using same technology powering Microsoft
Demo – Azure Data Lake Analytics Julian Lee
Integrated as part of an end-to-end suite: Cortana Analytics Dashboards and Visualizations Information Management Big Data Stores Machine Learning and Analytics Power BI Business apps Azure Machine Learning Azure Data Factory Personal Digital Assistant Azure Data Lake store Cortana People Azure Data Lake analytics service Azure Data Catalog Perceptual Intelligence Custom apps Face, vision Azure HDInsight, a Data Lake service (Hadoop and Spark) Azure SQL Data Warehouse Speech, text Business Scenarios Azure Event Hub Recommendations, customer churn, forecasting, etc. Azure Stream Analytics Sensors and devices Automated Systems ACTION DATA INTELLIGENCE ACTION
As an example,Virginia Tech is processing DNA sequencers generating 15PB/year Using big data to find treatments for cancer • Challenge • Previously used a network of supercomputers to process data from 2,000 DNA sequencers generating 15 PB of genome data • Solution • Created big data solution in the cloud (Azure HDInsight, a Data Lake service) • Business Transformation • Process PBs of data in a pay for compute model saving multi-millions of dollars. Their goal is to find a treatment for cancer.
Demo – Running HDInsights/R on ADL Julian Lee