290 likes | 410 Views
What’s all the Buzz about Hadoop and Hive?. Why it Matters for SQL Server Peeps. Cindy Gross, Microsoft SQLCAT PM @ SQLCindy | cgross@microsoft.com | http :// blogs.msdn.com/cindygross. The Plan. Big Data and SQLCAT/CX at PASS Summit Overview of Big Data, Hadoop, Hive
E N D
What’s all the Buzz about Hadoop and Hive? Why it Matters for SQL Server Peeps Cindy Gross, Microsoft SQLCAT PM @SQLCindy | cgross@microsoft.com | http://blogs.msdn.com/cindygross
The Plan • Big Data and SQLCAT/CX at PASS Summit • Overview of Big Data, Hadoop, Hive • Why SQL Pros Care • Next Steps
PASS Summit - SQLCAT: Big Data – All Abuzz About Hive [BIA-305-A] Speakers – Dipti Sangani and Cindy Gross Gain BI insights with HiveQL over HDFS/Hadoop How HiveQL generates MapReduce and outputs data Related, familiar tools How and when to use Hive
Microsoft Big Data at PASS Summit • Harnessing Big Data with Hadoop – Mike Flasko • SQLCAT: Big Data Warehousing– Len Wyatt, James Podgorski • How Klout Changed the Landscape of Social Media with Hadoop and BI – Denny Lee, Dave Mariani • MAD About Data: Solve Problems and Develop a “Data Driven Mindset” – Darwin Schweitzer
SQLCAT – Customer Experience (CX) • Implement leading edge features • Share lessons learned with the community • Change the product based on real customer experiences
Microsoft SQLCAT at PASS Summit – HA/DR • SQLCAT: Real-World Case Study of Mission-Critical Active/Active Remote DCs - Lindsey Allen, Prem Mehra • SQLCAT: AlwaysOn Unplugged: Everything You Want to Know About AlwaysOn - Sanjay Mishra • SQLCAT: AlwaysOn HA/DR Design Patterns, Architectures and Best Practices - Sanjay Mishra • SQLCAT: SQL Server 2012 AlwaysOn HA/ DR Customer Panel - Sanjay Mishra, AyadShammout, David Smith, Michael Steineke, Thomas Grohser, Wolfgang Kutschera
Microsoft SQLCAT at PASS Summit - Azure • SQLCAT: How Do I Troubleshoot My Database Now that It Is in the Cloud? - Silvano Coriani, Ewan Fairweather • SQLCAT: SQL Azure Design Patterns and Best Practices - Gus Apostol • SQLCAT: How SQL Azure Supports Large-Scale Customer Deployments - Silvano Coriani, Ewan Fairweather, Mark Simms, Michael Thomassy, Nicholas Dritsas • SQLCAT: What Are the Largest Azure Projects in the World? - Kevin Cox • SQLCAT: Best Practices for SQL Server in Azure VMs: config & performance - Steven Howard
Microsoft SQLCAT at PASS Summit – More! • SQLCAT: Configuring Kerberos for SharePoint 2010 BI in 7 Steps - Chuck Heinzelman • SQLCAT: What Are the Largest SQL Server Projects in the World?- Kevin Cox, Ewan Fairweather, Mark Souza • SQLCAT: SQLOS Memory Manager Changes in SQL Server 2012 - Gus Apostol, Jerome Halmans • SQLCAT: How Does Microsoft Run Its SAP Landscape on Windows and SQL Server? - Juergen Thomas • SQLCAT: Many-Core Processors, SSDs, Large Memory: How to Benefit SQL Server - Juergen Thomas • SQLCAT: Running Reporting Services in SharePoint Integrated Mode: How and Why– Chuck Heinzelman • SQLCAT: Case Study of Big Data in the Real World – Lindsey Allen, Lou Sawyer, Robert Abbott, Shep Sheppard
SQL Server Clinic • Got a Burning SQL Server Architecture Question? • Want to talk to someone about a problem you’re seeing on your servers? • Stop by the SQL Server Clinic at the PASS Summit this fall! • Members of SQLCAT and CSS will be on hand to talk to you about your issues! NEW EXPANDED LOCATION 4thFloor Across from the PASS Booth
What is Big Data • Find Insights - Explore, test, eliminate noise • Schema on Read, not Schema on Write • Structure may not be fully pre-defined • Scale out on commodity hardware – pay as you go • BASE instead of ACID • More programmer, lone wolf focused • MapReduce, streaming, machine learning, massively parallel processing • Something too big or complex for your current environment and resources to handle in a cost effective manner
Why Use Big Data – Use Cases • IT Management • SLA Monitoring • Cyber Security • Forensic Analysis • Financial Services • Risk Modeling • Threat Analysis • Fraud Detection • Credit Scoring • Telemetry Management • Clickstream and Application Log Analysis • Sensor Data • Online Commerce • Sentiment Analysis • Recommendation Engines • Search Indexing / Quality
VVVVroom! Volume – beyond what environment can handle Velocity – Need decisions fast Variety – Many formats Variability – Multiple interpretations
What is Hadoop • Most common Big Data technology • Powerful tool leading to insights • Open Source • Core – HDFS (storage) and MapReduce (send compute to data) • Hadoop Ecosystem • Trivia – Where did the name Hadoop come from?
Hadoop Ecosystem Snapshot ETL Tools BI Reporting RDBMS Mahout (ML) Lucene/Solr (search indexing) HCatalog Zookeepr (Coordination) Pig (Data Flow) Hive (SQL / DW) Sqoop (SSIS) Serialization (Thrift, Protobuf, Writable) MapReduce(Job Scheduling / Execution System) HBase (Column DB) Cassandra (Column DB) HDFS(Hadoop Distributed File System) External Stores (S3, Azure Blobs, Azure Data Market, etc) • Inspired by Tom White’s Hadoop: The Definitive Guide
What is Hive • Direct queries to Hadoop file system • Data warehousing framework on top of Hadoop • Structure without full relational modeling • Familiar-looking HiveQLusing metadata • Generates/runs MapReduce code (not faster than MR!)
Why Use Hive • Easy to use if you know SQL! • Makes Hadoop cross-correlations, joins, filters easier • Allows storage of intermediate results for faster/easier querying • Still slower than a relational database • Limited indexing, basically no statistics, caching or query optimizer • Append only
Who Plays with Big Data? • Data Scientists, Data Teams (DBAs, Devs, End Users, Statistics Experts) • Data Stewards, Data Curators (DBAs, specialists) • Infrastructure Admins – Hardware, Network, Windows, Database • Business/Data Analysts • BI Developers and BI Solution Architects • IT Pros
Big Data Plus SQL Server • Extract / Import between SQL, AS, Hadoop (especially Hive) • Tools like PowerPivot, Power View, Excel can mashup data from many sources such as SQL + Hive + DB2 • Explore in Hadoop, Productionalize in SQL Server or AS • Only put full structuring, cleansing effort into the most valuable data • Refine algorithms • Quick prototyping of data hypotheses • Use AS as index into Hadoop data • Archive SQL data into Hadoop (never lose data, store cheaply)
SQL Server is a Great Fit If…. • Updates • Filters, Joins, Subsets – Indexes and Optimizer! • You’ve already put effort into structure • You know what you need to know • Fast responses to individual queries • Not looking at entire data collection • ACID matters • Many, many, many existing and future applications
A Day in the Life - Developer • Write HQL, Pig, MapReduce • Rapid development • Lots of ad hoc code • Data Cleansing
A Day in the Life – DBA / Infrastructure • Backups – probably none • Data loads (in and out) / ETL – frequent, often changing • Archive • Data Curation, Cleansing • Cloud / Elasticity management • System management (installs, troubleshooting, performance, monitoring, trending, planning, hardware) • Write HQL
A Day in the Life – BI Expert • Explore data, especially unknown unknowns • Mashup data from many systems including Hive • Visualize data for Insights that change the business • Integrate with other systems • Write HQL • Bring Hive data into apps, reports • Do statistical analysis, modeling with R, Mahout
Why Get Involved Now • It’s the cool kid on the block – don’t underestimate this! • Help design the future • Cutting edge • Rare skill - few experts, you’ll stand out • Shows initiative • Understand when SQL or AS really is the better solution and/or a complimentary solution • PBs and EBs and and ZBs and YBs!
Microsoft Big Data Roadmap • Hadoop, Hive, PowerPivot, Power View, Hive ODBC Driver, Analysis Services, PDW, StreamInsight, SQL Server, Excel, Sqoop, Javascript • CTP - HadoopOnAzure now • Plans for Azure and on-premise Windows based Hadoop • Adapt existing code, add to the ecosystem • Look for exciting announcements soon!
Next Steps • Read a bit • http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning-resources.aspx • http://blogs.msdn.com/cindygross • Play around http://HadoopOnAzure.com • Think about how you can fit Big Data into your company data strategy, and when it’s not a good fit • Get involved - Suggest uses, be prepared to combat misuses • Sign up for PASS Summit 2012! • Then sign up for SQLCAT: Big Data – All Abuzz About Hive [BIA-305-A]
Summary • Big Data and SQLCAT/CX at PASS Summit • Overview of Big Data, Hadoop, Hive • Why SQL Pros Care • Next Steps
Questions?What’s all the Buzz about Hadoop and Hive? Why it Matters for SQL Server Peeps Cindy Gross, Microsoft SQLCAT PM @SQLCindy | cgross@microsoft.com | http://blogs.msdn.com/cindygross