120 likes | 332 Views
An Introduction to HDInsight. June 27 th , 2013 cschmidt@pragmaticworks.com @ sqlbischmidt http:// intelligentsql.wordpress.com. Big Data. Structured or Unstructured?. Structured data is identifiable Organized by columns and rows databases
E N D
An Introduction to HDInsight June 27th, 2013 cschmidt@pragmaticworks.com @sqlbischmidt http://intelligentsql.wordpress.com
Structured or Unstructured? • Structured data is identifiable • Organized by columns and rows • databases • Unstructured data has no such identifiable structure
HDInsight • Getting Started • “Apache”Hadoopbased service • Modern, cloud based solution platform that manages data of any type and/or size • Big data does not provide value on its own, it must be ETL’d
HDInsight (continued) • An HDInsight Azure instance consists of a head node (also called a namenode) and one or more data nodes • Benefits: • Integration into Social Media • Advanced Analytics • “Live” Changes • What’s the weather like right now?
MapReduce • MapReducetakes a large, unstructured data set and breaks it down by mapping, shuffling, and sorting the data to generate an output file that contains the level along with an output file • HDFS: Hadoop distributed file system • Data gets distributed over multiple drives on multiple servers • JAR files: bundled MapReduce code that can be compiled and executed
Pig • Pig is an alternative to writing Java scripting code for creating and running MapReduce jobs. • The language is called Pig Latin • Using Pig is a good way to reduce the time needed to create MapReduce programs • Many algorithms can be written in less than 5 lines of Pig Latin code!
Pig • Pig Latin statements follow a general flow of: • LOAD • TRANSFORM • DUMP or STORE • Pig Latin can be written in either grunt mode (interactive) or script mode (batch)
Hive • Hive is the “SQL like” language that lays on top of Hadoop • Commonly referred to as Hive Query Language (or HQL) • Structure without modeling • Hive can handle larger data sets than SQL as it queries data in parallel across multiple nodes using MapReduce
Data Explorer • Data Explorer is currently in Preview mode from Microsoft • Excel can connect directly to our HDInsight data cluster that we can use to bring data in for analysis. • Can then join this data with other relational sources to “mash” the data together
Additional Resources • Apache Homepage • https://cwiki.apache.org/confluence/display/Hive/GettingStarted • HDInsight • http://www.windowsazure.com/en-us/manage/services/hdinsight/ • Horton Works • http://hortonworks.com/