DAT353 Analysis Service: Server Internals

DAT353Analysis Service: Server Internals Tom Conlon Program ManagerSQL Server Business Intelligence UnitMicrosoft Corporation

Purpose of this Session • Remove some of the mystery • Explain how it is that we do some things so much better than our competitors • Things are easier to understand when the internals are understood • Requirements: • You already know the basics – this is for the experienced

Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

Analysis Manager SQL Server Data Warehouse OLEDB for OLAP DSO PivotTable Service Other OLE DB Providers OLEDB OLAP Store ADO MD Processing Querying Architecture – Single Server Application Analysis Server

Component Architecture - Query Server Storage Engine METADATA MANAGER METADATA MANAGER FORMULA ENGINE AGENT FORMULA ENGINE MDX CACHE CACHE MMSMDSRV.EXE MSOLAP80.DLL

Component Architecture - Management MSMDGD80.DLL MSMDCB80.DLL DCube Storage Engine METADATA MANAGER DCube PARSER Server Storage Engine METADATA MANAGER METADATA MANAGER FORMULA ENGINE AGENT FORMULA ENGINE MDX DDL CACHE CACHE MMSMDSRV.EXE MSOLAP80.DLL

Component Architecture - Distributed MSMDGD80.DLL MSMDCB80.DLL DCube Storage Engine METADATA MANAGER DCube PARSER Server Storage Engine METADATA MANAGER METADATA MANAGER FORMULA ENGINE AGENT FORMULA ENGINE MDX CACHE CACHE MMSMDSRV.EXE MSOLAP80.DLL MMSMDSRV.EXE Server Storage Engine METADATA MANAGER CACHE

Why Aggregations? • Aggregations can result in orders of magnitude improvement in performance • Don’t have to access every fact table record to determine query result • Further savings with data compression • Biggest savings: reduce disk scan

Aggregations - Overview Highest Level Aggregation Customers All Customers Country State City Name Product All Products Category Brand Name SKU Intermediate Aggregation Facts

Partial Aggregation • Don’t want to create all possible aggregations • Data explosion! • What if a query is made to a combination of levels where no aggregation exists? • Can compute from lower level aggregations • Don’t need to compute every possible aggregation Queries including a combination of Country and Brand can be answered if aggregation Country by Name exists. Customers All Customers Country State City Name Product All Products Category Brand Name SKU

Aggregations Show me all sales for all products for all . . . Highest level of aggregation (1,1,1,1,…) Most detailed Aggregations (m,m,m,…) Fact Table

Partial Aggregation Show me all sales for all products for all . . . Highest level of aggregation Most detailed Aggregations Fact Table

Pro. Family Pro. Family Quarter Product Month Quarter Products Month Aggregation Design Fact Table

Aggregation Design Results • Result: aggregations designed in waves from the top of the pyramid • At 100% aggregations, ‘waves’ all touch: overkill • 20-30% Generally adequate (0% for the smaller cubes) Fact Table

Aggregation Design • Which aggregations are more important than others? • All are equal • From design perspective, select the ones that result in overall improved query performance • Usage Based Optimization: Weightings on each aggregation based on usage frequency

Flexible and Rigid Aggregations • Changing dimensions allow members to be moved, added and deleted. • After members move, only incremental process of dimension is required • ‘Flexible’ aggregations deleted when a changing dimension is incrementally processed. • ‘Rigid’ aggregations remain valid When member X is moved from a child of A to a child of C, all aggregations involving A or C are invalided C B A X X

Aggregation Data Storage Aggregations including (All) level are rigid (if all other levels in the agg are rigid) • No impact on fact data or rigid aggregation data when changing dimension incrementally processed • Flexible aggregations are invalidated when changing dimension incrementally processed • Data is in three files: • partitionName.fact.data • partitionName.agg.rigid.data • [partitionName.agg.flex.data] Aggregations with this level are always flexible C B A Aggregations with this level are rigid (if all other levels in the agg are rigid) X X

Incremental Dimension Processing (Chg Dimension) • Potential Resource Competition during lazy processing after changing dimension incrementally processed • Fewer Aggregations! • Result: Query performance degradation Query and process dimension data: Keys, member names, member properties For each cube using this dimension Delete flexible Aggs and indexes Start lazy aggregating Start lazy indexing

demo Flexible Aggregation Demo

Data Storage • No data stored for empty member combinations • With compression – data storage approx 1/3 of space required in RDBMS source • Data is stored by record in pages • Each record contains all measures at an intersection of dimension members Page Record 1: mbrd1, mbrd2,…mbrdnm1, m2,…mn Record 2: mbrd1, mbrd2,…mbrdnm1, m2,…mn … Record 256: mbrd1, mbrd2,…mbrdnm1, m2,…mn

Data Structures • Partition data stored in a file divided into Segments • Each Segment contains 256 pages (each with 256 records) = 64K records Data File Segment n Segment 2 Segment 1 Page 1 Page 2 … Page 3 … … Only last segment has fewer than 256 pages Page 256

Clustering • Physical order of the records in each page and segment is organized to improve performance • Keeps records with same or close members together • Similar in concept to SQL clustered index where data sorted by key values • Try to minimize distribution of records with the same member across segments and pages • Optimized, but no algorithm can keep records for the same member (unless the cube contains a single dimension) • Similarly – SQL can only have a single clustered index • Records with identical dimension members can be in multiple segments • Data is read and processed in chunks (more on this later…)

Indexing • How is the data retrieved? • Cubes can be in the terabyte range • Scanning data files not an option • Need an index by dimension member • Answers question “Where is the data associated with this combination of dimension members?” • Map files provide this

Map Files Segment 1 Page 1 There is a map for each dimension which indicates the page where the member is included in a data record Page 2 Dimension 1 Map Page 3 Page 4 Page 5 Page 6 … Dimension 2 Map Page 256 To resolve a query containing a member from each dimension, get list of pages containing all members

Other Approaches • Array Based • Normally allocates a cell for every combination. • Result: Data explosion - much more disk space and longer processing times • Mix of Record and Array • ‘Dense’ dimensions are record like • Sparse are array like • Bit used per empty cell – sparsity explodes db sizes • User chooses decides whether a dimension is dense or sparse

Processing Buffer Memory Settings • ‘Read-Ahead Buffer Size’ is the buffer containing data read from source db • Defined in Server Property Dialog. Default: 4Meg • Rarely important – little effect when changed • Data is processed in chunks of ‘Process Buffer Size’ • Defined in Server Property Dialog • Data is clustered within Process Buffer Size • Bigger Process Buffer Size the better – make as big as possible • Data for dimension members is clustered to keep data for ‘close’ members close together • The larger these memory settings are, the more effective clustering

Incremental Processing Original partition Two Step Process • First, a partition is created with the incremental data • Second, the partition is merged with the original • Complete Segments of both partitions left intact – incomplete ones are merged • After many incremental processes, data distributed: degraded performance • Reprocess (if you have a large Process Buffer size) can provide improved performance … … … + … Incremental Partition

Querying a Cube • CLIENT • Select {[North America],[USA],[Canada]} on rows, Measures.members on columns from myCube • Need two things: • getting the dimension members – the axes • getting the data

Resolve Axis • Dimension members cached on client in ‘Client Member Cache’ • Levels with #members < Large Level Threshold sent in group • Levels with #members > Large Level Threshold retrieved as needed • Large Level Threshold default value:1000, can be changed in server property and in connection string • Where members not cached, members and descendents retrieved to client until needed member retrieved • Levels with members with 1000s of siblings result in degraded performance • Member cache not cleaned except for disconnect or when cube structure changes. Cached members ‘Christmas trees’ Requested member Non-cached members

Client Data Cache • Client retains data of previous queries in client data cache • Client Cache Size property controls how much data is in the client cache • When 0: unlimited • 1-99 (inclusive), percent of physical memory • >99 use up to the value in KB • Default value: 25 • When exceeded, client cache is cleaned at cube granularity

How Cubes Are Queried Client Service Data on disk Query Processor Client Data Cache Partition: Canada Segment 1 Segment 4 Segment 2 Segment 5 Dimension Memory Segment 3 Segment 6 Partition: Mexico Segment 1 Segment 4 Cache Memory Segment 2 Segment 5 Segment 3 Segment 6 Partition: USA Segment 1 Segment 4 Segment 2 Segment 5 Segment 3 Segment 6

Service Start Up • Minimum Allocated Memory defines the amount of memory completely dedicated to the server • All dimensions in the database are retained in memory • Tip: invalidate a dimension if not used in a cube • Dimension requirements: ~125 bytes per member plus member properties • 1M members: 125M • With 25 char member property (eg, Address): 175M • Large dimensions can migrate to separate process space Dimension Memory Minimum allocated memory

During Processing • Shadow dimensions • 2 copies of dimensions stored in memory while processing • Processing Buffers • Read Ahead Buffer size • Process Buffer Size • If dimension and processing buffers memory requirements exceed Memory Conservation Threshold - no room for data cache Dimension Memory Minimum allocated memory Memory conservation threshold Shadow Dimensions Processing Buffers Available Cache

During Querying Minimum allocated memory Dimension Memory Memory conservation threshold • Data cache stores query data for reuse • Faster than retrieving from storage • If Dimension Memory requirements > Memory Conservation Threshold, no Data Cache • ‘Cleaner’ wakes up periodically to reclaim memory from data cache • BackgroundInterval registry setting. Default value: 30 seconds Available Cache <= 0.5 * (Minimum Allocated Memory+ Memory Conservation Threshold): No cleaning • 0.5 * (Minimum Allocated Memory + Memory Conservation Threshold) and < Memory Conservation Threshold: mild cleaning • Memory Conservation Threshold: aggressive cleaning

demo Setting Server Properties

Sales Number of Customers All products 8000 200 Hardware 3300 80 Computers 2000 70 Monitors 800 60 Printers 500 30 Software 4700 150 Home 1500 100 Business 2500 100 Games 700 80 Distinct Count • Business Problem: Sales Manager wants to know: • “How many customers are buying Computers?” • “How many active customers do I have?”

Distinct Count: Changes to Data Structure • DC Measure stored with each fact and aggregation record • Just like a new dimension • Data ordered by DC measure • “Order by” included in SQL statement during processing • Number of records can be increased by orders of magnitude • Dependant on number of distinct values per record Sample aggregate record without Distinct Count… …# records increases with distinct count on customers

Distinct Count: Changes to Query • Single thread per partition instead of per segment • Unlike regular cubes, cannot do a single aggregation of results from each segment as a single value of the DC measure can cross segments • Consequently – performance impact • Dimension slice requires much more disk scan than before • Segments clustered by DC measure • Expensive

Distinct Count Tips • Keep DC measures in their own cube • All measures are retrieved on query – even if some are not asked for • Create virtual cube to merge DC with other measures • Incremental processing DC cubes is very expensive • Segments restructured and reordered to keep records ordered by DC measure • Time and memory intensive

Distinct Count Tips • Unlike regular cubes, best to distribute DC values evenly across each partition • Most effective use of multiple threads for query processing • If you have a dimenion that corresponds to DistinctCount Measure • Aggregations recommended only on lowest level • (Example, Customer dimension in cube, Customer as distinct count measure)

Summary • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

Don’t forget to complete the on-line Session Feedback form on the Attendee Web sitehttps://web.mseventseurope.com/teched/

DAT353 Analysis Service: Server Internals

DAT353 Analysis Service: Server Internals

Presentation Transcript

Administration for SQL Server - An Introduction

Windows Server 2008 R2

Chapter 8 Virtual Memory

Linux Kernel Internals

Hands-On Microsoft Windows Server 2008

Client-Server and Multicast Communication

The AGROVOC Concept Server Workbench

PHP

PHP

Chapter 8 Virtual Memory

Lesson 7

Integration of Field Server and Link Server in Bio-Environment

AM18 ASA INTERNALS: QUERY EXECUTION AND OPTIMIZATION

AM18 ASA INTERNALS: DATA MANAGEMENT

Services

데이터베이스 관리자를 위한 Microsoft SQL Server 2005 관리 및 유지보수 도구

Software I: Utilities and Internals

SQL

Knowing the Internals, Who Needs SQL Server Anyway?