250 likes | 436 Views
SQL Server: A Data Platform for Large-Scale Applications. José Blakeley, Software Architect Database Systems Group, Microsoft Corporation. SQL Server Design Philosophy. To solve 90+% of the problems automatically Self Tuning Dynamic Optimization Self Configuration Self Management
E N D
SQL Server: A Data Platform for Large-Scale Applications José Blakeley, Software Architect Database Systems Group, Microsoft Corporation
SQL Server Design Philosophy • To solve 90+% of the problems automatically • Self Tuning • Dynamic Optimization • Self Configuration • Self Management • Self Healing • The remaining percentage takes higher touch • Requires careful design for scalability • Good knowledge of how the RDBMS platform works • This talk describes how we are codifying our solution for large-scale DW into SQL Server UCI ISG Seminar
SQL Data PlatformAny Place, Any Type, Any Scale Enterprise Data Platform Dynamic Development Beyond Relational Pervasive Insight UCI ISG Seminar
Outline • SQL Server Data Platform • Basic concepts – common ground • Assumptions • Workloads • SQL Server Parallel DW DBMS • Philosophy • System Architecture • Software Architecture • Summary UCI ISG Seminar
Workload Types • Online Transaction Processing (OLTP) • Balanced read-update ratio (60%-40%) • Fine-grained inserts and updates • High transaction throughput e.g., 10s K/s • Usually very short transactions e.g., 1-3 tables • Sometimes multi-step e.g., financial • Relatively small data sizes e.g., few TBs • Data Warehousing and Business Analysis (DW) • Read-mostly (90%-10%) • Few updates in place, high-volume bulk inserts • Concurrent query throughput e.g., 10s K / hr • Per query response time < 2 s • Snowflake, star schemas are common e.g., 5-10 tables • Complex queries (filter, join, group-by, aggregation) • Very large data sizes e.g., 10s TB - PB OLTP = Day-to-day business DW = Analysis over recorded data UCI ISG Seminar
436,892,631 rows Star Join Schema 41 rows 34 rows SELECTFE0.RegionName, FL0.FiscalYearName, SUM (A.ActualRevenueAmt) FROMTECSPURSL00 A JOINSalesDate L ONA.SalesDateID = L.SalesDateID JOINUpperGeography UG ONA.TRCreditedSubsidiaryId = UG.CreditedSubsidiaryID JOINRegion FE0 ONUG.CreditedRegionID = FE0.RegionID JOINFiscalYear FL0 ONL.FiscalYearID = FL0.FiscalYearID GROUP BY FE0.RegionName, FL0.FiscalYearName 118 rows 13,517 rows UCI ISG Seminar 1/8/2010
Why look at workloads? $B • DBMS architectures of 30 years ago designed for mixed workloads • OLTP and DW workloads have grown into big market segments • Workload-specific engines can be profitable • Innovations are driven by deeper understanding of workloads 8 6 OLTP Mixed DW 4 2 1995 2000 2005 2010 UCI ISG Seminar
Changing TPC-H Landscape 68x perf 12% cost • New DW startups disrupting TPC-H • Column-store, compression, and scale-out UCI ISG Seminar
Data WarehousingSQL Server 2008 • Provides out-of-the-box SMP scale for ~10 TB • Data compression – row and page • Star join, bitmap filters, partitioned table parallelism • Minimally Logged INSERT • MERGE • Resource governor – CPU and memory • Many successful apps for 10-100 TB running today via custom scale-out • E.g., Danske Bank, Clalit Health, US Dep. of Agriculture, Pan-STARRS • Building out-of-the-box scale-out for 10s-100s of TB • This is the focus of this talk UCI ISG Seminar
Outline • SQL Server Data Platform • Basic concepts – common ground • Assumptions • Workloads • SQL Server Parallel DW DBMS • Design Philosophy • Hardware Architecture • Software Architecture • Summary UCI ISG Seminar
Design Philosophy • Shared-nothing, distributed, parallel DBMS • Implicit data and function partitioning • Appliance concept • Software + hardware solution • Optimized for DW workloads • Scalable to Petabytes UCI ISG Seminar
Appliance Concept DBMS System Software Parallel DW Reference Hardware Platforms INDUSTRY STANDARD SERVERS INDUSTRY STANDARD NETWORKING INDUSTRY STANDARD STORAGE UCI ISG Seminar
Hardware Architecture Compute Nodes Storage Nodes SQL SQL SQL SQL SQL SQL SQL SQL SQL Control Nodes Active / Passive Client Drivers (ODBC, OLE-DB, ADO.NET) Management Servers Data Center Monitoring Dual Infiniband Dual Fiber Channel Landing Zone ETL Load Interface Backup Nodes Spare Compute Node Corporate Backup Solution 1 Rack Server Appliance UCI ISG Seminar
Node Types • Control node: • Where clients apps connect • Parallel engine runs here • Contains system metadata • Compute nodes: • Store user data • Perform query execution • Not accessible to outside world • Landing Zone: • Staging place for data loading • Accessible to outside world • Can be augmented with 3rd party HW and SW • Backup node: • Backup file storage • Accessible to outside world • Can be augmented with 3rd party HW and SW • Management node: • Windows domain controller (Active Directory) • SW upgrades staging place • Holds SW images in case a node needs reimaging
Software Architecture Internet Explorer Query Tool MS BI (AS, RS) DWSQL Other 3rd Party Tools IIS Data Access (OLEDB, ODBC, ADO.NET, JDBC) Admin Console Compute Node Compute Nodes Data Movement Service Compute Nodes MPP Engine Coordinator Data Movement Service SQL Server User Data Landing Zone Node SQL Server Data Movement Service DW Authentication DW Configuration DW Schema TempDB Control Node UCI ISG Seminar
Key Components • MPP Engine Coordinator • Provides single system image • SQL compilation • Global metadata and appliance configuration • Global query optimization and plan generation • Global query execution coordination • Global transaction coordination • Authentication and authorization • Supportability (HW and SW status info) • Data Movement Service • Data movement across the appliance • Distributed query execution operators • Parallel Loader • Run from the Landing Zone • SSIS or command line tool • Parallel Database Copy • High performance data export • Enables Hub-Spoke scenarios • Parallel Backup/Restore • Backup files stored on Backup Nodes • Backup files may be archived into external device/system UCI ISG Seminar
Query Processing • SQL statement compilation • Parsing, validation, optimization • Builds an MPP execution plan • A set of parallel QE steps • Executes the plan • Coordinates workflow among steps • Assembles the resultset • Returns resultset to client
Example – Schema TPCH -------------------------------------------------------------------------------- -- Customer Table -- distributiond on c_custkey -------------------------------------------------------------------------------- CREATE TABLE customer ( c_custkeybigint, c_namevarchar(25), c_addressvarchar(40), c_nationkey integer, c_phone char(15), c_acctbal decimal(15,2), c_mktsegment char(10), c_commentvarchar(117)) WITH (distribution=hash(c_custkey)) ; -------------------------------------------------------------------------------- -- Orders Table -------------------------------------------------------------------------------- CREATE TABLE orders ( o_orderkeybigint, o_custkeybigint, o_orderstatus char(1), o_totalprice decimal(15,2), o_orderdate date, o_orderpriority char(15), o_clerk char(15), o_shippriority integer, o_commentvarchar(79)) WITH (distribution=hash(o_orderkey)) ; -------------------------------------------------------------------------------- -- LineItem Table -- distributiond on l_orderkey -------------------------------------------------------------------------------- CREATE TABLE lineitem ( l_orderkeybigint, l_partkeybigint, l_suppkeybigint, l_linenumberbigint, l_quantity decimal(15,2), l_extendedprice decimal(15,2), l_discount decimal(15,2), l_tax decimal(15,2), l_returnflag char(1), l_linestatus char(1), l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct char(25), l_shipmode char(10), l_commentvarchar(44)) WITH (distribution=hash(l_orderkey)) ; UCI ISG Seminar
Example - Query SELECT TOP 10 L_ORDERKEY, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE, O_ORDERDATE, O_SHIPPRIORITY FROM CUSTOMER, ORDERS, LINEITEM WHERE C_MKTSEGMENT = 'BUILDING' AND C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE < '1995-03-05' AND O_ORDERDATE >= '1994-09-15 00:00:00.000' GROUP BY L_ORDERKEY, O_ORDERDATE, O_SHIPPRIORITY ORDER BY REVENUE DESC, O_ORDERDATE ; UCI ISG Seminar
Example – Execution Plan ------------------------------ -- Step 1: create temp table at control node ------------------------------ CREATE TABLE [tempdb].[dbo].[Q_[TEMP_ID_664]] ( [l_orderkey] BIGINT, [REVENUE] DECIMAL(38, 4), [o_orderdate] DATE, [o_shippriority] INTEGER ); ------------------------------ -- Step 2: create temp tables at all compute nodes ------------------------------ CREATE TABLE [tempdb].[dbo].[Q_[TEMP_ID_665]_[PARTITION_ID]] ( [l_orderkey] BIGINT, [l_extendedprice] DECIMAL(15, 2), [l_discount] DECIMAL(15, 2), [o_orderdate] DATE, [o_shippriority] INTEGER, [o_custkey] BIGINT, [o_orderkey] BIGINT ) WITH ( DISTRIBUTION = HASH([o_custkey]) ); ------------------------------- -- Step 3: SHUFFLE_MOVE -------------------------------- SELECT [l_orderkey], [l_extendedprice], [l_discount], [o_orderdate], [o_shippriority], [o_custkey], [o_orderkey] FROM [dwsys].[dbo].[orders] JOIN [dwsys].[dbo].[lineitem] ON ([l_orderkey] = [o_orderkey]) WHERE ([o_orderdate] < '1995-03-05' AND [o_orderdate] >= '1994-09-15 00:00:00.000') INTO Q_[TEMP_ID_665]_[PARTITION_ID] SHUFFLE ON (o_custkey); ------------------------------ -- Step 4: PARTITION_MOVE ------------------------------ SELECT [l_orderkey], sum(([l_extendedprice] * (1 - [l_discount]))) AS REVENUE, [o_orderdate], [o_shippriority] FROM [dwsys].[dbo].[customer] JOINtempdb.Q_[TEMP_ID_665]_[PARTITION_ID] ON ([c_custkey] = [o_custkey]) WHERE [c_mktsegment] = 'BUILDING' GROUP BY [l_orderkey], [o_orderdate], [o_shippriority] INTO Q_[TEMP_ID_664]; ------------------------------ -- Step 5: Drop temp tables at all compute nodes ------------------------------ DROP TABLE tempdb.Q_[TEMP_ID_665]_[PARTITION_ID]; ------------------------------- -- Step 6: RETURN result to client -------------------------------- SELECT TOP 10 [l_orderkey], sum([REVENUE]) AS REVENUE, [o_orderdate], [o_shippriority] FROMtempdb.Q_[TEMP_ID_664] GROUP BY [l_orderkey], [o_orderdate], [o_shippriority] ORDER BY [REVENUE] DESC, [o_orderdate] ; ------------------------------- -- Step 7: Drop temp table at control node -------------------------------- DROP TABLE tempdb.Q_[TEMP_ID_664]; UCI ISG Seminar
Other Important Functionality • Fault tolerance • All HW components have redundancy: • CPUs, Disks, networks, power, storage processors • Control and Compute nodes use failover clustering • Management nodes have active & standby • Integration with SQL Server BI tools • SS Integration Services (ETL) has PDW as a destination • SS Analysis Services (OLAP) has PDW as a source • SS Reporting Services • Excel UCI ISG Seminar
Future Challenges • Richer DW analysis • Map-reduce-like functionality inside the cluster • Data mining, embedded analytics • Richer hub-and-spoke configurations • MPP to MPP • Distributed transactions • Isolation levels, deadlocks, logs • Distributed, parallel query processing • Execution strategies (DW) • Optimization techniques (DW) • Integration with other data services • Streaming • Increased HW architecture choices • Low-power clusters UCI ISG Seminar
Summary • SQL Server Data Platform overview • Workload types • SQL Server Parallel DW DBMS • Design Philosophy • System Architecture • Software Architecture • Query Example • Future challenges UCI ISG Seminar
Resources • Microsoft SQL Server 2008 case studies • http://www.microsoft.com/sqlserver/2008/en/us/case-studies.aspx • SQL Server best practices • http://technet.microsoft.com/en-us/sqlserver/bb671430.aspx • Dave Salch, Eric Kraemer, Umair Waheed, Paul Dyke, Fast Track Data Warehouse 2.0 Architecture, SQL Server Technical Article, MSDN, Nov. 2009. • SQL Server Customer Advisory Team • http://sqlcat.com/Default.aspx • Research on balanced HW architectures • Yogesh Simmhan, et. al. GrayWulf: Scalable Software Architecture for Data Intensive Computing, HICSS pp.1-10, 42nd Hawaii International Conference on System Sciences, 2009 • Alexander S. Szalay, et al., Low Power Amdahl-Balanced Blades for Data Intensive Computing, SC 2009. UCI ISG Seminar
THANKS! UCI ISG Seminar