1 / 29

What’s all the Buzz about Hadoop and Hive?

What’s all the Buzz about Hadoop and Hive?. Why it Matters for SQL Server Peeps. Cindy Gross, Microsoft SQLCAT PM @ SQLCindy | cgross@microsoft.com | http :// blogs.msdn.com/cindygross. The Plan. Big Data and SQLCAT/CX at PASS Summit Overview of Big Data, Hadoop, Hive

katy
Download Presentation

What’s all the Buzz about Hadoop and Hive?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What’s all the Buzz about Hadoop and Hive? Why it Matters for SQL Server Peeps Cindy Gross, Microsoft SQLCAT PM @SQLCindy | cgross@microsoft.com | http://blogs.msdn.com/cindygross

  2. The Plan • Big Data and SQLCAT/CX at PASS Summit • Overview of Big Data, Hadoop, Hive • Why SQL Pros Care • Next Steps

  3. PASS Summit - SQLCAT: Big Data – All Abuzz About Hive [BIA-305-A] Speakers – Dipti Sangani and Cindy Gross Gain BI insights with HiveQL over HDFS/Hadoop How HiveQL generates MapReduce and outputs data Related, familiar tools How and when to use Hive

  4. Microsoft Big Data at PASS Summit • Harnessing Big Data with Hadoop – Mike Flasko • SQLCAT: Big Data Warehousing– Len Wyatt, James Podgorski • How Klout Changed the Landscape of Social Media with Hadoop and BI – Denny Lee, Dave Mariani • MAD About Data: Solve Problems and Develop a “Data Driven Mindset” – Darwin Schweitzer

  5. SQLCAT – Customer Experience (CX) • Implement leading edge features • Share lessons learned with the community • Change the product based on real customer experiences

  6. Microsoft SQLCAT at PASS Summit – HA/DR • SQLCAT: Real-World Case Study of Mission-Critical Active/Active Remote DCs - Lindsey Allen, Prem Mehra • SQLCAT: AlwaysOn Unplugged: Everything You Want to Know About AlwaysOn - Sanjay Mishra • SQLCAT: AlwaysOn HA/DR Design Patterns, Architectures and Best Practices - Sanjay Mishra • SQLCAT: SQL Server 2012 AlwaysOn HA/ DR Customer Panel - Sanjay Mishra, AyadShammout, David Smith, Michael Steineke, Thomas Grohser, Wolfgang Kutschera

  7. Microsoft SQLCAT at PASS Summit - Azure • SQLCAT: How Do I Troubleshoot My Database Now that It Is in the Cloud? - Silvano Coriani, Ewan Fairweather • SQLCAT: SQL Azure Design Patterns and Best Practices - Gus Apostol • SQLCAT: How SQL Azure Supports Large-Scale Customer Deployments - Silvano Coriani, Ewan Fairweather, Mark Simms, Michael Thomassy, Nicholas Dritsas • SQLCAT: What Are the Largest Azure Projects in the World? - Kevin Cox • SQLCAT: Best Practices for SQL Server in Azure VMs: config & performance - Steven Howard

  8. Microsoft SQLCAT at PASS Summit – More! • SQLCAT: Configuring Kerberos for SharePoint 2010 BI in 7 Steps - Chuck Heinzelman • SQLCAT: What Are the Largest SQL Server Projects in the World?- Kevin Cox, Ewan Fairweather, Mark Souza • SQLCAT: SQLOS Memory Manager Changes in SQL Server 2012 - Gus Apostol, Jerome Halmans • SQLCAT: How Does Microsoft Run Its SAP Landscape on Windows and SQL Server? - Juergen Thomas • SQLCAT: Many-Core Processors, SSDs, Large Memory: How to Benefit SQL Server - Juergen Thomas • SQLCAT: Running Reporting Services in SharePoint Integrated Mode: How and Why– Chuck Heinzelman • SQLCAT: Case Study of Big Data in the Real World – Lindsey Allen, Lou Sawyer, Robert Abbott, Shep Sheppard

  9. SQL Server Clinic • Got a Burning SQL Server Architecture Question? • Want to talk to someone about a problem you’re seeing on your servers? • Stop by the SQL Server Clinic at the PASS Summit this fall! • Members of SQLCAT and CSS will be on hand to talk to you about your issues! NEW EXPANDED LOCATION 4thFloor Across from the PASS Booth

  10. What is Big Data • Find Insights - Explore, test, eliminate noise • Schema on Read, not Schema on Write • Structure may not be fully pre-defined • Scale out on commodity hardware – pay as you go • BASE instead of ACID • More programmer, lone wolf focused • MapReduce, streaming, machine learning, massively parallel processing • Something too big or complex for your current environment and resources to handle in a cost effective manner

  11. Why Use Big Data – Use Cases • IT Management • SLA Monitoring • Cyber Security • Forensic Analysis • Financial Services • Risk Modeling • Threat Analysis • Fraud Detection • Credit Scoring • Telemetry Management • Clickstream and Application Log Analysis • Sensor Data • Online Commerce • Sentiment Analysis • Recommendation Engines • Search Indexing / Quality

  12. VVVVroom! Volume – beyond what environment can handle Velocity – Need decisions fast Variety – Many formats Variability – Multiple interpretations

  13. What is Hadoop • Most common Big Data technology • Powerful tool leading to insights • Open Source • Core – HDFS (storage) and MapReduce (send compute to data) • Hadoop Ecosystem • Trivia – Where did the name Hadoop come from?

  14. Hadoop Ecosystem Snapshot ETL Tools BI Reporting RDBMS Mahout (ML) Lucene/Solr (search indexing) HCatalog Zookeepr (Coordination) Pig (Data Flow) Hive (SQL / DW) Sqoop (SSIS) Serialization (Thrift, Protobuf, Writable) MapReduce(Job Scheduling / Execution System) HBase (Column DB) Cassandra (Column DB) HDFS(Hadoop Distributed File System) External Stores (S3, Azure Blobs, Azure Data Market, etc) • Inspired by Tom White’s Hadoop: The Definitive Guide

  15. What is Hive • Direct queries to Hadoop file system • Data warehousing framework on top of Hadoop • Structure without full relational modeling • Familiar-looking HiveQLusing metadata • Generates/runs MapReduce code (not faster than MR!)

  16. Why Use Hive • Easy to use if you know SQL! • Makes Hadoop cross-correlations, joins, filters easier • Allows storage of intermediate results for faster/easier querying • Still slower than a relational database • Limited indexing, basically no statistics, caching or query optimizer • Append only

  17. Who Plays with Big Data? • Data Scientists, Data Teams (DBAs, Devs, End Users, Statistics Experts) • Data Stewards, Data Curators (DBAs, specialists) • Infrastructure Admins – Hardware, Network, Windows, Database • Business/Data Analysts • BI Developers and BI Solution Architects • IT Pros

  18. Big Data Plus SQL Server • Extract / Import between SQL, AS, Hadoop (especially Hive) • Tools like PowerPivot, Power View, Excel can mashup data from many sources such as SQL + Hive + DB2 • Explore in Hadoop, Productionalize in SQL Server or AS • Only put full structuring, cleansing effort into the most valuable data • Refine algorithms • Quick prototyping of data hypotheses • Use AS as index into Hadoop data • Archive SQL data into Hadoop (never lose data, store cheaply)

  19. SQL Server is a Great Fit If…. • Updates • Filters, Joins, Subsets – Indexes and Optimizer! • You’ve already put effort into structure • You know what you need to know • Fast responses to individual queries • Not looking at entire data collection • ACID matters • Many, many, many existing and future applications

  20. A Day in the Life - Developer • Write HQL, Pig, MapReduce • Rapid development • Lots of ad hoc code • Data Cleansing

  21. A Day in the Life – DBA / Infrastructure • Backups – probably none • Data loads (in and out) / ETL – frequent, often changing • Archive • Data Curation, Cleansing • Cloud / Elasticity management • System management (installs, troubleshooting, performance, monitoring, trending, planning, hardware) • Write HQL

  22. A Day in the Life – BI Expert • Explore data, especially unknown unknowns • Mashup data from many systems including Hive • Visualize data for Insights that change the business • Integrate with other systems • Write HQL • Bring Hive data into apps, reports • Do statistical analysis, modeling with R, Mahout

  23. Why Get Involved Now • It’s the cool kid on the block – don’t underestimate this! • Help design the future • Cutting edge • Rare skill - few experts, you’ll stand out • Shows initiative • Understand when SQL or AS really is the better solution and/or a complimentary solution • PBs and EBs and and ZBs and YBs!

  24. Microsoft Big Data Roadmap • Hadoop, Hive, PowerPivot, Power View, Hive ODBC Driver, Analysis Services, PDW, StreamInsight, SQL Server, Excel, Sqoop, Javascript • CTP - HadoopOnAzure now • Plans for Azure and on-premise Windows based Hadoop • Adapt existing code, add to the ecosystem • Look for exciting announcements soon!

  25. Demo

  26. Next Steps • Read a bit • http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning-resources.aspx • http://blogs.msdn.com/cindygross • Play around http://HadoopOnAzure.com • Think about how you can fit Big Data into your company data strategy, and when it’s not a good fit • Get involved - Suggest uses, be prepared to combat misuses • Sign up for PASS Summit 2012! • Then sign up for SQLCAT: Big Data – All Abuzz About Hive [BIA-305-A]

  27. Summary • Big Data and SQLCAT/CX at PASS Summit • Overview of Big Data, Hadoop, Hive • Why SQL Pros Care • Next Steps

  28. Questions?What’s all the Buzz about Hadoop and Hive? Why it Matters for SQL Server Peeps Cindy Gross, Microsoft SQLCAT PM @SQLCindy | cgross@microsoft.com | http://blogs.msdn.com/cindygross

  29. Thank You for Attending

More Related