1 / 12

An Introduction to HDInsight

An Introduction to HDInsight. June 27 th , 2013 cschmidt@pragmaticworks.com @ sqlbischmidt http:// intelligentsql.wordpress.com. Big Data. Structured or Unstructured?. Structured data is identifiable Organized by columns and rows databases

vine
Download Presentation

An Introduction to HDInsight

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to HDInsight June 27th, 2013 cschmidt@pragmaticworks.com @sqlbischmidt http://intelligentsql.wordpress.com

  2. Big Data

  3. Structured or Unstructured? • Structured data is identifiable • Organized by columns and rows • databases • Unstructured data has no such identifiable structure

  4. HDInsight • Getting Started • “Apache”Hadoopbased service • Modern, cloud based solution platform that manages data of any type and/or size • Big data does not provide value on its own, it must be ETL’d

  5. HDInsight (continued) • An HDInsight Azure instance consists of a head node (also called a namenode) and one or more data nodes • Benefits: • Integration into Social Media • Advanced Analytics • “Live” Changes • What’s the weather like right now?

  6. MapReduce • MapReducetakes a large, unstructured data set and breaks it down by mapping, shuffling, and sorting the data to generate an output file that contains the level along with an output file • HDFS: Hadoop distributed file system • Data gets distributed over multiple drives on multiple servers • JAR files: bundled MapReduce code that can be compiled and executed

  7. Map Reduce Data Flow

  8. Pig • Pig is an alternative to writing Java scripting code for creating and running MapReduce jobs. • The language is called Pig Latin • Using Pig is a good way to reduce the time needed to create MapReduce programs • Many algorithms can be written in less than 5 lines of Pig Latin code!

  9. Pig • Pig Latin statements follow a general flow of: • LOAD • TRANSFORM • DUMP or STORE • Pig Latin can be written in either grunt mode (interactive) or script mode (batch)

  10. Hive • Hive is the “SQL like” language that lays on top of Hadoop • Commonly referred to as Hive Query Language (or HQL) • Structure without modeling • Hive can handle larger data sets than SQL as it queries data in parallel across multiple nodes using MapReduce

  11. Data Explorer • Data Explorer is currently in Preview mode from Microsoft • Excel can connect directly to our HDInsight data cluster that we can use to bring data in for analysis. • Can then join this data with other relational sources to “mash” the data together

  12. Additional Resources • Apache Homepage • https://cwiki.apache.org/confluence/display/Hive/GettingStarted • HDInsight • http://www.windowsazure.com/en-us/manage/services/hdinsight/ • Horton Works • http://hortonworks.com/

More Related