240 likes | 249 Views
An introduction to HDInsight. Edinson Medina SR PFE for Data and AI Microsoft Services. Who Am I?. Edinson Medina SR PFE Data and AI Domain Microsoft Services UK Venezuelan @ sqldixitox https://www.linkedin.com/in/edinsonmedina/. Roles in the room?. What is Big Data?.
E N D
An introduction to HDInsight Edinson Medina SR PFE for Data and AI Microsoft Services
Who Am I? Edinson Medina SR PFE Data and AI Domain Microsoft Services UK Venezuelan @sqldixitox https://www.linkedin.com/in/edinsonmedina/
What is Big Data? • Data that is too large or complex for analysis in traditional relational databases • Typified by the “3 V’s”: • Volume – Huge amounts of data to process • Could be TBs, PBs or EBs • Variety – A mixture of structured and unstructured data • Structured, Semi-structured, Unstructured • Velocity – New data generated extremely frequently • Stream Processing, Real Time, Batch Sensor and IoT Processing Web server click-streams Social media sentiment analysis
Batch Processing Real-Time Processing Predictive Analytics Filter, cleanse, and shape data for analysis ..110100101001.. Apply statistical algorithms for classification, regression, clustering, and prediction Capture, filter, and aggregate streams of data for low-latency querying
What is Hadoop Map Reduce can Map and Reduce data • Big Data not the same as Hadoop • What is the MapReduce process? • What is HDFS? • MapReduce Engine vs Tez Engine Hadoop Cluster Head Node Worker Nodes can:1 Map:1 Reduce:1 Map:1 and:1 Reduce:1 data:1 Map:2 Reduce:2 can:1 and:1 Data:1 HDFS Map:2 Reduce:2 can:1 and:1 Data:1 Map:2 Reduce:2 can:1 and:1 Data:1 Map:2 Reduce:2 can:1 and:1 Data:1
set hive.execution.engine=mr; SELECT… set hive.execution.engine=tez; SELECT… Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce
What is HDInsight? • Microsoft’s Hadoop distribution • Powered by the cloud • 100% Apache Hadoop • Immersive insights
Spark Hadoop ecosystem in HDInsight Streaming (Storm) Metadata (HCatalog) Graph (Pegasus) Stats processing (RHadoop) Business Intelligence (Excel, Power View, SSAS…) Active Directory (Ranger) Pipeline / workflow (Oozie) NoSQL Database (HBase) Data Integration ( ODBC / SQOOP/ REST) Scripting (Pig) Query (Hive) Machine Learning (Mahout) Distributed Processing (Map Reduce or TEZ) System Center (Future) Log file aggregation (Flume) YARN Distributed Storage (HDFS)
A metadata service that projects tabular schemas over folders • Enables the contents of folders to be queried as tables, using SQL-like query semantics • Queries are translated into jobs • Execution engine can be Tez or MapReduce SELECT…
Pig performs a series of transformations to data relations based on Pig Latin statements • Relations are loaded using schema on read semantics to project table structure at runtime • You can run Pig Latin statements interactively in the Grunt shell, or save a script file and run them as a batch
A workflow engine for actions in a Hadoop cluster • MapReduce • Hive • Pig • Others • Support parallel workstreams and conditional branching
Sqoop is a database integration service • Built on open source Hadoop technology • Enables bi-directional data transfer between Hadoop clusters and databases via JDBC
A low-latency, NoSQL database built on Hadoop • Modeled on Google’s BigTable • HBase stores data in StoreFiles on HDFS HBase HDFS
What is NoSQL • A type of databases • Don’t use the relational model • Good fit for distributed environments NoSQL has very little to do with SQL (structured query language), It should have been called Not Only Relational Databases Schema-less / schema-free Focus on performance over consistence
What is a Stream of data? 01100101 01100101 01100101 01100101 01100101 01100101 01100101 01100101 01100101 A unbounded sequence of event data Stream processing is continuous Aggregation is based on temporal windows
An event processor for data streams • Defines a streaming topology that consists of: • Spouts: Consume data sources and emit streams that contain tuples • Bolts: Operate on tuples in streams • Storm topologies run continuously on streams of data • Real-time monitoring • Event aggregation and logging Spout Bolt
A fast, general purpose computation engine that supports in-memory operations • A unified stack for interactive, streaming, and predictive analysis • Can run in Hadoop clusters
So, Do you need big data? • Are your data volumes truly “big”? • Many times we regulate on how much data we save • Are you collection enough? • Is it needed? • Do you required to constantly accommodate new data • Is your business transactional only • How will you benefit from it? • Are you ready for it? • You will need to filter trough the noise • Skills and expertise
Demo Create Spark Cluster in Azure HDInsight Processing Big Data with Hive Connect using PowerBI desktop Predictive analysis with spark
Just like Jimi Hendrix … We love to get feedback Please complete the session feedback forms
SQLBits - It's all about the community... Please visit Community Corner, we are trying this year to get more people to learn about the SQL Community, equally if you would be happy to visit the community corner we’d really appreciate it.