390 likes | 526 Views
SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq.com. Agenda. DW Problems Bitmap Filtered Hash Joins Table Partitioning Filtered Indexes Indexed Views Data Compression Window Functions Columnstore Indexes. Algorithms Complexity.
E N D
SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq.com
Agenda • DW Problems • Bitmap Filtered Hash Joins • Table Partitioning • Filtered Indexes • Indexed Views • Data Compression • Window Functions • Columnstore Indexes
Algorithms Complexity Forever* = about 40 billion billionyears!
SSAS Dimensional Addressing Axis(1) Axis(1).Position(3) Axis(1).Position(1).Members(2) Every cell has an address
SSAS Tabular Problems • SSAS address space: mncells • Maximum number of possible combinations200 * 5000 * 1095 = 109,500,000 • SSAS address space grows exponentially! • Can run out of address space – limited scalability
RDBMS Joins • Merge: complexity ~ O(n) • Needs sorted inputs, equijoin • Hash: complexity ~ O(n) / ~O(n2) • Needs equijoin • Nested Loops: complexity ~ O(n)(indexed), ~ O(n2)(not indexed) • Works always, can become quadratic • Non-equijoins are frequently quadratic • E.g., running totals
Bitmap Filtered Star Joins • Optimized bitmap filtering for star schema joins • Bitmap representation of a set of values from a dim table to pre-filter rows to join from a fact table • Enables filtering rows early in the plan, allowing subsequent operators to operate on fewer rows
Bloom Filter (1)* • Bloom filter is a bit array of mbits • Start with all bits set to 0 • kdifferent hash functions defined • Each of which maps some set element to one of the mpositions with a uniform random distribution • To add an element, feed it to each of the k hash functions to get k array positions • Set the bits at all these positions to 1 Source: Wikipedia
Bloom Filter (2) • To test whether and element it is in the set, feed it to each of the k hash functions to get k array positions • If any of the bits at these positions are 0, the element is not in the set • If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements
Table Partitioning • Partition function • Partition scheme • Aligned indexes • Partition elimination • Partition switching
Filtered Indexes • Where clause in the Create Index statement • Small B-trees on subset of data only • Useful when some values are selective, while others dense • Index on selective values only
Indexed Views • Useful for queries that aggregate data • Can also reduce number of joins • Depending on edition of SQL Server can be used automatically • No need to change reporting queries • Many limitations
Data Compression • Pre-SQL 2005: variable-length data types • SQL 2005: vardecimal • SQL 2008 • Row compression • Page compression • SQL 2008 R2 • Unicode compression
SQL 2008 Compression • Row compression • Fixed-width data type values stored in variable format • Page compression • Prefix compression • Dictionary compression
Unicode Compression • Works on nchar(n) and nvarchar(n) • Automatically with row or page compression • Savings depends on language • Up to 50% in English, German • Only 15% in Japanese • Very low performance penalty
Window Functions • Functions operating on a window (set) of rows defined by an OVER clause • Types of functions: • Ranking • Aggregate • Distribution SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runqty FROM Sales.EmpOrders;
Window Functions in SQL Server • SQL Server 2005: • Ranking calculations • Aggregates with only window partitioning • SQL Server 2012: • Aggregates with also window ordering and framing • Offset functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE • Distribution functions: PERCENT_RANK, CUME_DIST, PERCENTILE_CONT, PERCENTILE_DISC
SQL Server DW / OLAP Offerings VertiPaq • Personal and team level • PowerPivot for Excel (client) • PowerPivot for SharePoint (server) • Corporate level • SQL Server • SSAS Tabular • SSAS Dimensional • Fast Track Data Warehouse • Parallel Data Warehouse
Trans-Relational Model • Not “beyond” relational • Transformation between logical and physical layer • Steve Tarin, Required Technologies Inc. (1999) • All columns stored in sorted order • All joins become merge joins • Can condense storage • Of course, updates suffer • Logically, this is a pure relational model • SQL Server uses own variant • Order of columns not preserved – optimized for compression • Leverages parallel hash joins rather than merge joins
SQL Server Solution (1)* • Converting rows to column segments Source: SQL Server Column Store Indexes by Per-Åke Larson, et al., Microsoft SIGMOD’10, June 12–16, 2011
SQL Server Solution (2) • Storing column segments as BLOBs • Leverages existing BLOB storage • Additional segment metadata • Multiple compression algorithms
Columnstore Compression • Encoding values to 32-bit or 64-bit integer • Dictionary-based encoding • Value-based (prefix) encoding • Optimal row ordering with VertiPaq™ algorithm to rearrange rows • Optimal ordering for Run-Length Encoding (RLE) for best overall compression • Compression • RLE - data stored as <value, count> pairs • Bit-Pack– use min number of bits for a value
Result: Reduced I/O • Fetches only needed columns from disk • Columns are compressed • Less IO • Better buffer hit rates SELECT region, sum (sales) … C2 C3 C6 C4 C5 C1
Result: Reading Segments • Column segment contains values from one column for a set of about 1M rows • Column segment is unit of transfer from disk • Storage engine can eliminate segments early in the process • Because of additional column segment metadata C1 C4 C5 C6 C3 C2 Set of about 1M rows Column Segment
Reducing CPU Usage • Columnstore indexes reduce disk IO • Bitmap-filtered hash joins can be executed in parallel • Problem: CPU becomes a bottleneck • Solution: reduce CPU usage by processing large numbers of rows • Iterators that do not process row-at-a-time • Process batch-at-a-time
Batch Processing • Orthogonal to columnstore indices • Can support other storage • However, best results with columnstore indices • Sometimes can perform batch operations directly on compressed data • Can mix batch and row operators • Can dynamically switch from batch to row mode
Batch Operators • The following operators support batch mode processing: • Filter • Project • Scan • Local hash (partial) aggregation • Hash inner join • Batch hash table build Source: http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-index-faq.aspx#Batch_mode_processing
Columnstore Indexes Constraints • Base table must be clustered B-tree or heap • Columnstore index: • Nonclustered • One per table • Must be partition-aligned • Not allowed on indexed view • Can’t be a filtered index
Data Type Restrictions • Unsupported types • Decimal > 18 digits • Binary • BLOB • (n)varchar(max) • Uniqueidentifier • Date/time types > 8 bytes • CLR
Query Performance Restrictions • Outer joins • Unions • Consider modifying queries to hit “sweet spot” • Inner joins • Star joins • Aggregation
Loading New Data Columnstore index makes table read-only Partition switching allowed INSERT, UPDATE, DELETE, and MERGE not allowed Two recommended methods for loading data Disable, update, rebuild Partition switching
Columnstore Indexes Usage • Use when: • Read-mostly workload • Most updates are appending new data • Workflow permits partitioning or index drop/rebuild • Queries often scan & aggregate lots of data • Use on fact (and large dimensions) tables • Do not use when: • Frequent updates • Partition switching or rebuilding index doesn’t fit workflow • Frequent small look up queries • VertiPaq cannot handle your data model
Review • DW Problems • Bitmap Filtered Hash Joins • Table Partitioning • Filtered Indexes • Indexed Views • Data Compression • Windows Functions • Columnstore Indexes
Q & A • Questions? • Thank you for coming to this conference… • …and this presentation!
References • Books: • SQL Server Books OnLine • Dejan Sarka, Grega Jerkičand Matija Lah: MCTS Self-Paced Training Kit (Exam 70-463): Building Data Warehouses with Microsoft SQL Server 2012 • Courses and Seminars • SQL Server 2012 and SharePoint BI Immersion • Advanced Transact-SQL