An Introduction to HDInsight

An Introduction to HDInsight June 27th, 2013 cschmidt@pragmaticworks.com @sqlbischmidt http://intelligentsql.wordpress.com

Big Data

Structured or Unstructured? • Structured data is identifiable • Organized by columns and rows • databases • Unstructured data has no such identifiable structure

HDInsight • Getting Started • “Apache”Hadoopbased service • Modern, cloud based solution platform that manages data of any type and/or size • Big data does not provide value on its own, it must be ETL’d

HDInsight (continued) • An HDInsight Azure instance consists of a head node (also called a namenode) and one or more data nodes • Benefits: • Integration into Social Media • Advanced Analytics • “Live” Changes • What’s the weather like right now?

MapReduce • MapReducetakes a large, unstructured data set and breaks it down by mapping, shuffling, and sorting the data to generate an output file that contains the level along with an output file • HDFS: Hadoop distributed file system • Data gets distributed over multiple drives on multiple servers • JAR files: bundled MapReduce code that can be compiled and executed

Map Reduce Data Flow

Pig • Pig is an alternative to writing Java scripting code for creating and running MapReduce jobs. • The language is called Pig Latin • Using Pig is a good way to reduce the time needed to create MapReduce programs • Many algorithms can be written in less than 5 lines of Pig Latin code!

Pig • Pig Latin statements follow a general flow of: • LOAD • TRANSFORM • DUMP or STORE • Pig Latin can be written in either grunt mode (interactive) or script mode (batch)

Hive • Hive is the “SQL like” language that lays on top of Hadoop • Commonly referred to as Hive Query Language (or HQL) • Structure without modeling • Hive can handle larger data sets than SQL as it queries data in parallel across multiple nodes using MapReduce

Data Explorer • Data Explorer is currently in Preview mode from Microsoft • Excel can connect directly to our HDInsight data cluster that we can use to bring data in for analysis. • Can then join this data with other relational sources to “mash” the data together

Additional Resources • Apache Homepage • https://cwiki.apache.org/confluence/display/Hive/GettingStarted • HDInsight • http://www.windowsazure.com/en-us/manage/services/hdinsight/ • Horton Works • http://hortonworks.com/

An Introduction to HDInsight

An Introduction to HDInsight

Presentation Transcript

an introduction to

An introduction to…

AN INTRODUCTION TO

An Introduction to

An Introduction to

AN INTRODUCTION TO:

An Introduction to

AN INTRODUCTION TO:

An Introduction to:

An Introduction to:

An Introduction to…

AN INTRODUCTION TO:

AN INTRODUCTION TO:

AN INTRODUCTION TO:

An Introduction to

An Introduction to

AN INTRODUCTION TO:

An introduction to…

An Introduction to

AN INTRODUCTION TO: