SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq

SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq.com

Agenda • DW Problems • Bitmap Filtered Hash Joins • Table Partitioning • Filtered Indexes • Indexed Views • Data Compression • Window Functions • Columnstore Indexes

Algorithms Complexity Forever* = about 40 billion billionyears!

SSAS Dimensional Addressing Axis(1) Axis(1).Position(3) Axis(1).Position(1).Members(2) Every cell has an address

SSAS Tabular Problems • SSAS address space: mncells • Maximum number of possible combinations200 * 5000 * 1095 = 109,500,000 • SSAS address space grows exponentially! • Can run out of address space – limited scalability

RDBMS Joins • Merge: complexity ~ O(n) • Needs sorted inputs, equijoin • Hash: complexity ~ O(n) / ~O(n2) • Needs equijoin • Nested Loops: complexity ~ O(n)(indexed), ~ O(n2)(not indexed) • Works always, can become quadratic • Non-equijoins are frequently quadratic • E.g., running totals

Linearize Joins

Bitmap Filtered Star Joins • Optimized bitmap filtering for star schema joins • Bitmap representation of a set of values from a dim table to pre-filter rows to join from a fact table • Enables filtering rows early in the plan, allowing subsequent operators to operate on fewer rows

Bloom Filter (1)* • Bloom filter is a bit array of mbits • Start with all bits set to 0 • kdifferent hash functions defined • Each of which maps some set element to one of the mpositions with a uniform random distribution • To add an element, feed it to each of the k hash functions to get k array positions • Set the bits at all these positions to 1 Source: Wikipedia

Bloom Filter (2) • To test whether and element it is in the set, feed it to each of the k hash functions to get k array positions • If any of the bits at these positions are 0, the element is not in the set • If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements

Table Partitioning • Partition function • Partition scheme • Aligned indexes • Partition elimination • Partition switching

Filtered Indexes • Where clause in the Create Index statement • Small B-trees on subset of data only • Useful when some values are selective, while others dense • Index on selective values only

Indexed Views • Useful for queries that aggregate data • Can also reduce number of joins • Depending on edition of SQL Server can be used automatically • No need to change reporting queries • Many limitations

Data Compression • Pre-SQL 2005: variable-length data types • SQL 2005: vardecimal • SQL 2008 • Row compression • Page compression • SQL 2008 R2 • Unicode compression

SQL 2008 Compression • Row compression • Fixed-width data type values stored in variable format • Page compression • Prefix compression • Dictionary compression

Unicode Compression • Works on nchar(n) and nvarchar(n) • Automatically with row or page compression • Savings depends on language • Up to 50% in English, German • Only 15% in Japanese • Very low performance penalty

Window Functions • Functions operating on a window (set) of rows defined by an OVER clause • Types of functions: • Ranking • Aggregate • Distribution SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runqty FROM Sales.EmpOrders;

Window Functions in SQL Server • SQL Server 2005: • Ranking calculations • Aggregates with only window partitioning • SQL Server 2012: • Aggregates with also window ordering and framing • Offset functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE • Distribution functions: PERCENT_RANK, CUME_DIST, PERCENTILE_CONT, PERCENTILE_DISC

SQL Server DW / OLAP Offerings VertiPaq • Personal and team level • PowerPivot for Excel (client) • PowerPivot for SharePoint (server) • Corporate level • SQL Server • SSAS Tabular • SSAS Dimensional • Fast Track Data Warehouse • Parallel Data Warehouse

Trans-Relational Model • Not “beyond” relational • Transformation between logical and physical layer • Steve Tarin, Required Technologies Inc. (1999) • All columns stored in sorted order • All joins become merge joins • Can condense storage • Of course, updates suffer • Logically, this is a pure relational model • SQL Server uses own variant • Order of columns not preserved – optimized for compression • Leverages parallel hash joins rather than merge joins

Columnar Storage (1)

Columnar Storage (2)

Row Reconstruction Table

SQL Server Solution (1)* • Converting rows to column segments Source: SQL Server Column Store Indexes by Per-Åke Larson, et al., Microsoft SIGMOD’10, June 12–16, 2011

SQL Server Solution (2) • Storing column segments as BLOBs • Leverages existing BLOB storage • Additional segment metadata • Multiple compression algorithms

Columnstore Compression • Encoding values to 32-bit or 64-bit integer • Dictionary-based encoding • Value-based (prefix) encoding • Optimal row ordering with VertiPaq™ algorithm to rearrange rows • Optimal ordering for Run-Length Encoding (RLE) for best overall compression • Compression • RLE - data stored as <value, count> pairs • Bit-Pack– use min number of bits for a value

Result: Reduced I/O • Fetches only needed columns from disk • Columns are compressed • Less IO • Better buffer hit rates SELECT region, sum (sales) … C2 C3 C6 C4 C5 C1

Result: Reading Segments • Column segment contains values from one column for a set of about 1M rows • Column segment is unit of transfer from disk • Storage engine can eliminate segments early in the process • Because of additional column segment metadata C1 C4 C5 C6 C3 C2 Set of about 1M rows Column Segment

Reducing CPU Usage • Columnstore indexes reduce disk IO • Bitmap-filtered hash joins can be executed in parallel • Problem: CPU becomes a bottleneck • Solution: reduce CPU usage by processing large numbers of rows • Iterators that do not process row-at-a-time • Process batch-at-a-time

Batch Processing • Orthogonal to columnstore indices • Can support other storage • However, best results with columnstore indices • Sometimes can perform batch operations directly on compressed data • Can mix batch and row operators • Can dynamically switch from batch to row mode

Batch Operators • The following operators support batch mode processing: • Filter • Project • Scan • Local hash (partial) aggregation • Hash inner join • Batch hash table build Source: http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-index-faq.aspx#Batch_mode_processing

Columnstore Indexes Constraints • Base table must be clustered B-tree or heap • Columnstore index: • Nonclustered • One per table • Must be partition-aligned • Not allowed on indexed view • Can’t be a filtered index

Data Type Restrictions • Unsupported types • Decimal > 18 digits • Binary • BLOB • (n)varchar(max) • Uniqueidentifier • Date/time types > 8 bytes • CLR

Query Performance Restrictions • Outer joins • Unions • Consider modifying queries to hit “sweet spot” • Inner joins • Star joins • Aggregation

Loading New Data Columnstore index makes table read-only Partition switching allowed INSERT, UPDATE, DELETE, and MERGE not allowed Two recommended methods for loading data Disable, update, rebuild Partition switching

Columnstore Indexes Usage • Use when: • Read-mostly workload • Most updates are appending new data • Workflow permits partitioning or index drop/rebuild • Queries often scan & aggregate lots of data • Use on fact (and large dimensions) tables • Do not use when: • Frequent updates • Partition switching or rebuilding index doesn’t fit workflow • Frequent small look up queries • VertiPaq cannot handle your data model

Review • DW Problems • Bitmap Filtered Hash Joins • Table Partitioning • Filtered Indexes • Indexed Views • Data Compression • Windows Functions • Columnstore Indexes

Q & A • Questions? • Thank you for coming to this conference… • …and this presentation!

References • Books: • SQL Server Books OnLine • Dejan Sarka, Grega Jerkičand Matija Lah: MCTS Self-Paced Training Kit (Exam 70-463): Building Data Warehouses with Microsoft SQL Server 2012 • Courses and Seminars • SQL Server 2012 and SharePoint BI Immersion • Advanced Transact-SQL

SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq

SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq

Presentation Transcript

M icrosoft Data Warehousing - SQL Server 7.0 -

SQL Server Integration Services Deep Dive

SQL Server 2005: Deep Dive On XML And XQuery

Developing with SQL Server Spatial: Deep Dive into Spatial Indexing

DATA WAREHOUSING IN SQL SERVER 2005/2008 BUSINESS INTELLIGENCE

DATA WAREHOUSING IN SQL SERVER 2005/2008 BUSINESS INTELLIGENCE

SQL Server 2012 Big Data Overview

Deep Dive on Windows Server 2008

Deep Dive into ETL Implementation with SQL Server Integration Services

SQL Server 2012

SQL Server 2012 Parallel Data Warehouse

V12 Group Data Deep Dive

SQL Server 2008 R2 Data Warehousing

UM Data Deep Dive

Deep Dive into XML and XQuery in SQL Server 2005

SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq

Data Warehousing Features in SQL Server 2008

SQL Server 2012 Parallel Data Warehouse