530 likes | 844 Views
DBI321. Building BI Solutions with SQL Server PDW AU3. Matt Peebles, Artin Avanes Microsoft Corporation. Agenda. Trends in the DW space How does SQL Server PDW fit in? SQL Server PDW AU3 – What’s new? Building BI Solutions with SQL Server PDW Customer Successes
E N D
DBI321 Building BI Solutions with SQL Server PDW AU3 Matt Peebles, Artin Avanes Microsoft Corporation
Agenda • Trends in the DW space • How does SQL Server PDW fit in? • SQL Server PDW AU3 – What’s new? • Building BI Solutions with SQL Server PDW • Customer Successes • Using SQL Server PDW with Microsoft BI solutions • Using SQL Server PDW with third party BI solutions • BI solutions leveraging Hadoop integration • What’s coming next in SQL Server PDW?
Trends in the Data Warehousing Space Understanding the Opportunity • Performance at scale: ability to analyze massive amounts of data • DW systems continue to grow at a fast pace, scalabilityis a key concern, growing a system from 10s of TBs, to 100s of TB, to PBs Data Warehousing has shifted almost entirely towards the appliance model due to speed of the balanced appliance and scalability of scale out (MPP) solutions. Jim Cobelius, Forrester Research Source: TDWI Report – Next Generation DW
Trends in the Data Warehousing Space Understanding the Opportunity • Appliances are the key trend in the next 4 years (4 Billion market by ‘15) • Cloud DW longer-term • Box is a slow decline CAGR Share(‘15) 7.1% 4.6% 7.1% 5.0% 30.0% 26.2% -0.3% 60.4% Source: MS internal analysis, DBSMIT Cloud Market Opportunity Forecast
Agenda • Trends in the DW space • How does SQL Server PDW fit in? • SQL Server PDW AU3 – What’s new? • Building BI Solutions with SQL Server PDW • Customer Successes • Using SQL Server PDW with Microsoft BI solutions • Using SQL Server PDW with third party BI solutions • BI solutions leveraging Hadoop integration • What’s coming next in SQL Server PDW?
What is Parallel Data Warehouse (PDW)? SQL Server Data Warehousing in Appliance Model Scale out Scalable Standards Based Flexible Cost Effective
PDW Hardware Architecture Control Rack Compute Racks (1/2 to 4) Storage Nodes Compute Nodes Control Nodes Active / Passive Client Drivers Management Nodes Data Center Monitoring Dual Fiber Channel Dual Infiniband Landing Zone ETL Load Interface Backup Node Corporate Backup Solution Spare Compute Node Corporate Network Appliance Network
Time Dim Product Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc Store Dim Store Dim ID Store Name Store Mgr Store Size PDW Data Example PDW Compute Nodes SQL SQL SQL SQL Sales Facts Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End
Time Dim Product Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc Store Dim Store Dim ID Store Name Store Mgr Store Size PDW Data Example Smaller Dimension Tables are Replicated on Every Compute Node SQL SQL SQL SQL PD TD MD SD PD TD MD SD Sales Facts Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold PD TD SD MD PD TD MD SD Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End
Time Dim Product Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Prod Dim ID Prod Category Prod Sub Cat Prod Desc Store Dim Store Dim ID Store Name Store Mgr Store Size PDW Data Example Larger Fact Table is Hash Distributed Across All Compute Nodes SQL SQL SQL SQL PD SF-1 TD MD SD SF-2 PD TD MD SD Sales Facts SF-3 SF-4 SF-1 SF-2 Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold SF-3 PD TD SD MD SF-4 PD TD MD SD Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End
SQL Server Parallel Data WarehouseA quick look at MPP query execution The actual user data resides on compute nodes, and stepsof the global execution plan are executed on each compute node SQL Server PDW Appliance Client Control Node Compute Node 1 SQL Server PDW is a shared nothing MPP system, meaning user data is distributed across the nodes*. Data Movement Serviceis responsible for moving data around so that individual nodes can satisfy queries that need data from other nodes. Compute Node 2 The user connects to ‘the appliance’likehe would to a ‘normal’ SQL Server, and sends his request The control node handles global query execution, and generates a distributed execution plan . . . Compute Node N
Dealing with Distributions - Shuffling Example: Select [color], SUM([qty]) from [Store Sales] group by [color]; Return Distributed Table Compute Node 2 Compute Node 1 Temp_1 Shuffle Movement DMS Redistributes the data by color values in parallel. Store Sales color qty Ss_id color qty Hash Red 5 Red 5 1 Red 8 Blue 11 3 Parallel Merge and Aggregate Red 12 Red 12 5 color qty Green 7 Green 7 7 Blue 21 Hash Hash Green 7 Red 25 Temp_1 Store Sales Yellow 12 color qty Ss_id color qty Blue 11 Red 8 2 Hash Blue 10 Blue 10 4 Yellow 12 Yellow 12 6
SQL Server Parallel Data WarehouseOverall Architecture Control Rack Data Rack (up to 4) Control Node Compute Node 1 DMS Core PDW Engine Client Interface (JDBC, ODBC, OLE-DB, ADO.NET) PDW Agent DMS Manager Compute Node 2 PDW Agent DMS Core PDW Agent … Landing Zone Node ETL Interface Bulk Data Loader PDW Agent Management Node Compute Node 10 Active Directory DMS Core Legend: PDW Agent PDW Agent PDW service PDW = Parallel Data Warehouse DMS = Data Movement Service
Agenda • Trends in the DW space • How does SQL Server PDW fit in? • SQL Server PDW AU3 – What’s new? • Building BI Solutions with SQL Server PDW • Customer Successes • Using SQL Server PDW with Microsoft BI solutions • Using SQL Server PDW with third party BI solutions • BI solutions leveraging Hadoop integration • What’s coming next in SQL Server PDW?
SQL Server Parallel Data Warehouse AU3 Release Themes Performance At Scale BI, Analytics, & ETL Integration SQL Server Compatibility Less work for the same results Do the same work more efficiently • Native Support for • Analysis Services • Reporting Services • PowerPivot • Lay the foundation for broad connectivity support Broader functionality Full Alignment
SQL Server PDW Architecture How did it work before? • Problem • Basic RDBMS functionality, that already exists in SQL Server, was re-built in PDW • Challenge for PDW AU3 release • Can we leverage SQL Server and focus on MPP related challenges? Control Node
SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer SQL Server runs a ‘Shell Appliance’ Every database exists as an empty ‘shell’ • All objects, no user data DDL executes against both the shell and the compute nodes Large parts of basic RDBMS functionality now provided by the shell • Authentication and authorization • Schema binding • Metadata catalog SELECT SELECT Engine Service Shell Appliance (SQL Server) foo Control Node Plan Steps Plan Steps Plan Steps Compute Node (SQL Server) Compute Node (SQL Server) Compute Node (SQL Server) foo foo foo
SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer • User issues a query • Query is sent to the Shell through sp_showmemo_xml stored procedure • SQL Server performs parsing, binding, authorization • SQL optimizer generates execution alternatives • MEMO containing candidate plans, histograms, data types is generated • Parallel execution plan generated • Parallel plan executes on compute nodes • Result returned to the user SELECT SELECT Engine Service Shell Appliance (SQL Server) Control Node MEMO Return Plan Steps Plan Steps Plan Steps Compute Node (SQL Server) Compute Node (SQL Server) Compute Node (SQL Server)
PDW Cost-Based OptimizerOptimizer lifecycle… 1. Simplification and space exploration • Query standardization and simplification (e.g. column reduction, predicates push-down) • Logical space exploration (e.g. join re-ordering, local/global aggregation) • Space expansion (e.g. bushy trees – dealing with intermediate resultsets) • Physical space exploration • Serializing MEMO into binary XML (logical plans) • De-serializing binary XML into PDW Memo 2. Parallel optimization and pruning • Injecting data move operations (expansion) • Costing different alternatives • Pruning and selecting lowest cost distributed plan 3. SQL Generation • Generating SQL Statements to be executed
PDW Cost-Based Optimizer… And Cost Model Details • PDW cost model assumptions: • Costing only data movement operations (relational operations excluded) • Sequential step execution (no pipelined and independent parallelism) • Data movement operations consist of multiple tasks • Each task has Fixed and Variable overhead • Uniform data distribution assumed (no data skew)
PDW Sales Test WorkloadAU2 to AU3 • 5x improvement in terms of total elapsed time out of the box Seconds Queries
Theme: Performance at ScaleZero data conversions in data movement Goal • Eliminate CPU utilization spent on data conversions • Further parallelize operations during data moves Functionality • Using ODBC instead of ADO.NET for reading and writing data • Minimizing applianceresource utilizationfor data moves Benefits • Betterresource, CPU, utilization • 6x or more faster moveoperations • Increased concurrency • Mixed workload (loads + queries)
Theme: SQL Server CompatibilitySQL Server Security and Metadata Security • SQL Server security syntax and semantics • Supporting user, rolesandlogins • Fixed database roles • Allows script re-use • Allows well-known security methods Metadata • PDW metadata stored in SQL Server • ExistingSQL Server metadata tables/views (e.g. security views) • PDW distribution info as extended propertiesin SQL Servermetadata • Existing means and technology for persisting metadata • Improved 3rdpartytool compatibility(BI, ETL)
Theme: SQL Server CompatibilitySupport for SQL Server (Native) Client Goal • ‘Look’ just like a normal SQL Server • Better integration with other BI tools Functionality • Useexisting SQL Server drivers to connect to SQL Server PDW • ImplementSQL Server TDSprotocol • Named Parameter support • SQLCMD connectivityto PDW Benefits • Use known tools and proven technology stack • ExistingSQL Server ’eco-system’ • 2x performance improvementfor return operations • 5x reduction of connection time Server: 10.217.165.13, 17001 TDS SQL Server Clients (ADO.NET, ODBC, OLE-DB, JDBC) SequeLink SQL PDW Clients (ODBC, OLE-DB, ADO.NET) Server: 10.217.165.13, 17000
Theme: SQL Server CompatibilityStored Procedure Support (Subset) Syntax Goal • Support common scenarios of code encapsulation and reuse in Reporting and ETL Functionality • System and user-defined stored procedures • Invocation using RPC or EXECUTE • Control flow logic, input parameters Benefits • Enables common logic re-use • Big impact for Reporting Services scenarios • Allows portingexisting scripts • Increases compatibilitywith SQL Server CREATE { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;] ALTER { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;] DROP { PROC | PROCEDURE } { [dbo.]procedure_name } [;] [ { EXEC | EXECUTE } ] { { [database_name.][schema_name.]procedure_name } [{ value | @variable }] [ ,...n ] } [;] { EXEC | EXECUTE } ( { @string_variable | [ N ]'tsql_string' } [ + ...n ] ) [;] Unsupported Functionality Stored Proc Nesting Output Params Return Try-Catch
Theme: SQL Server CompatibilityCollations Syntax Goal • Support local and international data Functionality • Fixed server level collation • User-defined column level collation • Supporting all Windows collations • Allow COLLATE clauses in Queries and DML Benefits • Storeall the data in PDW w/ additional querying flexibility • ExistingT-SQL DDLand Query scripts • SQL Server alignmentand functionality CREATE TABLE T ( c1 varchar(3) COLLATEtraditional_Spanish_ci_ai, c2 varchar(10) COLLATE …) SELECT c1 COLLATE Latin1_General_Bin2 FROM T SELECT * FROM T ORDER BY c1 COLLATE Latin1_General_Bin2 • Unsupported Functionality • Cannot specify DB collation during DB creation • Cannot alter column collations for existing tables
Theme: Improved IntegrationSQL Server PDW Connectors Connector for Hadoop • Bi-directional(import/export) interfacebetween MSFT Hadoop and PDW • Delimited file support • Adapter uses existing PDW tools (bulk loader, dwsql) • Low cost solution that handles all the data: structured and unstructured • Additional agility, flexibilityand choice Connector for Informatica • Connector providing PDW source and target (mappings, transformations) • InformaticausesPDW bulk loader for fast loads • Leverage existing toolset and knowledge Connector for Business Objects
Agenda • Trends in the DW space • How does SQL Server PDW fit in? • SQL Server PDW AU3 – What’s new? • Building BI Solutions with SQL Server PDW • Customer Successes • Using SQL Server PDW with Microsoft BI solutions • Using SQL Server PDW with third party BI solutions • BI solutions leveraging Hadoop integration • What’s coming next in SQL Server PDW?
PDW Retail POS WorkloadOriginal Customer SMP solution vs. PDW AU3 (with cost-based query optimizer) Seconds Queries
Customer SuccessesHow are customers using PDW & BI ? CUSTOMER EXAMPLE: Stock Exchange in the US Data Volume • 80 TB data warehouse analyzing data from exchanges • Existing system based on SQL SMP farm • 2 different clusters of 6 serverseach Requirement • Linear scalability with additional hardware • Support hourly loads with SSIS – 300GB/day • BI Integration: SSRS, SSAS and PowerPivot AU3 Feedback • SP and increased T-SQL support was great • Migrating SMP SSRS to PDW was painless • 142x for scan heavy queries & no summary tables • Enabled queries that do not run on existing system Reports Portal Dashboards ETL Scorecards PDW Operational DB’s
Customer Successes – cont’dHow are customers using PDW & BI ? CUSTOMER EXAMPLE: Major Retailer in the US Data Volume • 36 TB data warehouse analyzing data from transactional and clickstream sources • Business need to expand to 7 year data window (currently 1 year data) Requirements • Scalability - growing data volume does not affect performance • Performance and ad-hoc analysis for interactive querying by users • BI Integration with Microsoft BI stack - SSAS and SSRS AU3 Feedback • SSAS cubes worked ‘out-of-box’ • Performance an order of magnitude faster than existing system (~30x on an expanded data set)
Agenda • Trends in the DW space • How does SQL Server PDW fit in? • SQL Server PDW AU3 – What’s new? • Building BI Solutions with SQL Server PDW • Customer Successes • Using SQL Server PDW with Microsoft BI solutions • Using SQL Server PDW with third party BI solutions • BI solutions leveraging Hadoop integration • What’s coming next in SQL Server PDW?
Role of PDW within the BI stack PDW role as fast ‘data hub’ • Fast and parallel feeding of data marts (DMs) via Infiniband • CREATE REMOTE TABLE AS SELECT • Aggregation abilities avoids ETL overhead in existing systems • No need for indexes • No need to maintain indexed/materialized views (summary tables) GBit link SSAS / SSRS DM DM SSAS / SSRS Infiniband PDW SSAS / SSRS DM 3rd party BI
SSAS with SQL Server PDWUnderstanding the differences compared to ‘SMP world’ Specific to PDW • PDW does not support foreign key constraints • Shared nothing model requires careful data design and retrieval planning • Design cubes for parallel processing – via MOLAP & ROLAP storage model Specific to the nature of large data • Parallel cube processing/deployment has its limits • Cautious about parallel loads of SSAS - query timeout settings • Query design crucial - only include required data • BI tools traditionally not designed for handling huge amount of data
demo PowerPivot Connection Comparison with SQL Server PDW
Agenda • Trends in the DW space • How does SQL Server PDW fit in? • SQL Server PDW AU3 – What’s new? • Building BI Solutions with SQL Server PDW • Customer Successes • Using SQL Server PDW with Microsoft BI solutions • Using SQL Server PDW with third party BI solutions • BI solutions leveraging Hadoop integration • What’s coming next in SQL Server PDW?
Supported Third Party BI Solutions • AU3 T-SQL compatibility allows for common access for multiple tools • Current support on PDW drivers includes • Microstrategy • BusinessObjects • Cognos • Other tools have ‘mixed experience’ • Cognossupport required : CURRENT_TIMESTAMP , @@DATEFIRST, SET OPTION … • Core connectivity enhancements planned for the next 2 releases
Agenda • Trends in the DW space • How does SQL Server PDW fit in? • SQL Server PDW AU3 – What’s new? • Building BI Solutions with SQL Server PDW • Customer Successes • Using SQL Server PDW with Microsoft BI solutions • Using SQL Server PDW with third party BI solutions • BI solutions leveraging Hadoop integration • What’s coming next in SQL Server PDW?
New Challenges for Business Analytics • Huge amount of data born ‘unstructured’ • Increasing demand for (near) real-time business analytics • Pre-filtering of important from less relevant raw data required Applications • Sensor networks & RFID • Social networks & Mobile Apps • Biological & Genomics HADOOP Sensor/RFID Data Web Data Blogs, Docs
Hadoop as a Platform SolutionIn the context of ETL , BI , and DW • Platform to accelerate ETL processes (not competing with current ETL software tools!) • Flexible and fast development of ‘hand-written’ refining requests of raw data • Active & cost effective data archive to let (historical) data ‘live forever’ • Co-existence with a relational DW (not completely replacing it !)
Importing HDFS data into PDW for advanced BI Application Programmers DBMS Admin Power BI Users SQOOP HADOOP SQL Server PDW Sensor/RFID Data Web Data Blogs, Docs Interactive BI/Data Visualization
Hadoop - PWD Integration via SQOOP (export) SQOOP export with source (HDFS path) & target (PDW DB & table) Copies incoming data on Landing Zone • PDW HadoopConnector … Read HDFS data via mappers Compute Node 1 Compute Node 8 Invokes‘DWLoader’ … • FTP Server • Telnet Server • 1. • 3. • 2. • 4. • 5. HDFS Landing Zone • PDW-configuration file Compute Nodes Control Node Linux/Hadoop Windows/PDW
demo HadoopSqoop Connector with SQL Server PDW
Agenda • Trends in the DW space • How does SQL Server PDW fit in? • SQL Server PDW AU3 – What’s new? • Building BI Solutions with SQL Server PDW • Customer Successes • Using SQL Server PDW with Microsoft BI solutions • Using SQL Server PDW with third party BI solutions • BI solutions leveraging Hadoop integration • What’s coming next in SQL Server PDW?
SQL Server PDW Roadmap What is coming next? CALENDAR YEAR 2011 CALENDAR YEAR 2012 Q3 Q1 Q2 Q3 Q4 Q2 Q4 Q1 Shipped Shipped Shipped Appliance Update 1 Appliance Update 2 V-Next Appliance Update 3 • Improved node manageability • Better performance and reduced overhead • OEM requests • Cost based optimizer • Native SQL Server drivers, including JDBC • Collations • More expressive query language • Data Movement Services performance • SCOM pack • Stored procedures (subset) • Half-rack • Columnar store index • Stored procedures • Integrated Authentication • PowerView integration • Workload management • LZ/BU redundancy • Windows 8 • SQL Server 2012 • Hardware refresh • Programmability • Batches • Control flow • Variables • Temp tables • QDR infiniband switch • Onboard Dell • 3rd party integration (Informatica, MicroStrategy, Business Objects, HADOOP)
In Review • Session Objectives • Provide an overview of SQL Server PDW • Introduce PDW AU3 and share details regarding the new features and their impact on BI scenarios • Key Takeaways • PDW is the SQL Server DW Appliance for 10-100s TB • AU3 enables you to use your existing BI solutions on Microsoft & 3rd Party BI Tools • Expect at least 5x performanceimprovements over PDW AU2 • Specific workloads can see much more
Related Content • DBI209 – Big Data, Big Deal Lots of BI Tool Specific Related Sessions (PowerPivot, Analysis services, Etc.) Breakthrough Insights: Big Data Analytics & Data Warehousing Demo Station PDW Deep Dive Session Online from TechEd 2010
Track Resources Hands-On Labs @sqlserver @ms_teched SQL Server 2012 Eval Copy Get Certified! mva Microsoft Virtual Academy
Resources Learning TechNet • Connect. Share. Discuss. • Microsoft Certification & Training Resources http://northamerica.msteched.com www.microsoft.com/learning • Resources for IT Professionals • Resources for Developers • http://microsoft.com/technet http://microsoft.com/msdn