490 likes | 632 Views
DAT353 Analysis Service: Server Internals. Tom Conlon Program Manager SQL Server Business Intelligence Unit Microsoft Corporation. Purpose of this Session. Remove some of the mystery Explain how it is that we do some things so much better than our competitors
E N D
DAT353Analysis Service: Server Internals Tom Conlon Program ManagerSQL Server Business Intelligence UnitMicrosoft Corporation
Purpose of this Session • Remove some of the mystery • Explain how it is that we do some things so much better than our competitors • Things are easier to understand when the internals are understood • Requirements: • You already know the basics – this is for the experienced
Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count
Analysis Manager SQL Server Data Warehouse OLEDB for OLAP DSO PivotTable Service Other OLE DB Providers OLEDB OLAP Store ADO MD Processing Querying Architecture – Single Server Application Analysis Server
Component Architecture - Query Server Storage Engine METADATA MANAGER METADATA MANAGER FORMULA ENGINE AGENT FORMULA ENGINE MDX CACHE CACHE MMSMDSRV.EXE MSOLAP80.DLL
Component Architecture - Management MSMDGD80.DLL MSMDCB80.DLL DCube Storage Engine METADATA MANAGER DCube PARSER Server Storage Engine METADATA MANAGER METADATA MANAGER FORMULA ENGINE AGENT FORMULA ENGINE MDX DDL CACHE CACHE MMSMDSRV.EXE MSOLAP80.DLL
Component Architecture - Distributed MSMDGD80.DLL MSMDCB80.DLL DCube Storage Engine METADATA MANAGER DCube PARSER Server Storage Engine METADATA MANAGER METADATA MANAGER FORMULA ENGINE AGENT FORMULA ENGINE MDX CACHE CACHE MMSMDSRV.EXE MSOLAP80.DLL MMSMDSRV.EXE Server Storage Engine METADATA MANAGER CACHE
Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count
Why Aggregations? • Aggregations can result in orders of magnitude improvement in performance • Don’t have to access every fact table record to determine query result • Further savings with data compression • Biggest savings: reduce disk scan
Aggregations - Overview Highest Level Aggregation Customers All Customers Country State City Name Product All Products Category Brand Name SKU Intermediate Aggregation Facts
Partial Aggregation • Don’t want to create all possible aggregations • Data explosion! • What if a query is made to a combination of levels where no aggregation exists? • Can compute from lower level aggregations • Don’t need to compute every possible aggregation Queries including a combination of Country and Brand can be answered if aggregation Country by Name exists. Customers All Customers Country State City Name Product All Products Category Brand Name SKU
Aggregations Show me all sales for all products for all . . . Highest level of aggregation (1,1,1,1,…) Most detailed Aggregations (m,m,m,…) Fact Table
Partial Aggregation Show me all sales for all products for all . . . Highest level of aggregation Most detailed Aggregations Fact Table
Pro. Family Pro. Family Quarter Product Month Quarter Products Month Aggregation Design Fact Table
Aggregation Design Results • Result: aggregations designed in waves from the top of the pyramid • At 100% aggregations, ‘waves’ all touch: overkill • 20-30% Generally adequate (0% for the smaller cubes) Fact Table
Aggregation Design • Which aggregations are more important than others? • All are equal • From design perspective, select the ones that result in overall improved query performance • Usage Based Optimization: Weightings on each aggregation based on usage frequency
Flexible and Rigid Aggregations • Changing dimensions allow members to be moved, added and deleted. • After members move, only incremental process of dimension is required • ‘Flexible’ aggregations deleted when a changing dimension is incrementally processed. • ‘Rigid’ aggregations remain valid When member X is moved from a child of A to a child of C, all aggregations involving A or C are invalided C B A X X
Aggregation Data Storage Aggregations including (All) level are rigid (if all other levels in the agg are rigid) • No impact on fact data or rigid aggregation data when changing dimension incrementally processed • Flexible aggregations are invalidated when changing dimension incrementally processed • Data is in three files: • partitionName.fact.data • partitionName.agg.rigid.data • [partitionName.agg.flex.data] Aggregations with this level are always flexible C B A Aggregations with this level are rigid (if all other levels in the agg are rigid) X X
Incremental Dimension Processing (Chg Dimension) • Potential Resource Competition during lazy processing after changing dimension incrementally processed • Fewer Aggregations! • Result: Query performance degradation Query and process dimension data: Keys, member names, member properties For each cube using this dimension Delete flexible Aggs and indexes Start lazy aggregating Start lazy indexing
demo Flexible Aggregation Demo
Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count
Data Storage • No data stored for empty member combinations • With compression – data storage approx 1/3 of space required in RDBMS source • Data is stored by record in pages • Each record contains all measures at an intersection of dimension members Page Record 1: mbrd1, mbrd2,…mbrdnm1, m2,…mn Record 2: mbrd1, mbrd2,…mbrdnm1, m2,…mn … Record 256: mbrd1, mbrd2,…mbrdnm1, m2,…mn
Data Structures • Partition data stored in a file divided into Segments • Each Segment contains 256 pages (each with 256 records) = 64K records Data File Segment n Segment 2 Segment 1 Page 1 Page 2 … Page 3 … … Only last segment has fewer than 256 pages Page 256
Clustering • Physical order of the records in each page and segment is organized to improve performance • Keeps records with same or close members together • Similar in concept to SQL clustered index where data sorted by key values • Try to minimize distribution of records with the same member across segments and pages • Optimized, but no algorithm can keep records for the same member (unless the cube contains a single dimension) • Similarly – SQL can only have a single clustered index • Records with identical dimension members can be in multiple segments • Data is read and processed in chunks (more on this later…)
Indexing • How is the data retrieved? • Cubes can be in the terabyte range • Scanning data files not an option • Need an index by dimension member • Answers question “Where is the data associated with this combination of dimension members?” • Map files provide this
Map Files Segment 1 Page 1 There is a map for each dimension which indicates the page where the member is included in a data record Page 2 Dimension 1 Map Page 3 Page 4 Page 5 Page 6 … Dimension 2 Map Page 256 To resolve a query containing a member from each dimension, get list of pages containing all members
Other Approaches • Array Based • Normally allocates a cell for every combination. • Result: Data explosion - much more disk space and longer processing times • Mix of Record and Array • ‘Dense’ dimensions are record like • Sparse are array like • Bit used per empty cell – sparsity explodes db sizes • User chooses decides whether a dimension is dense or sparse
Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count
Processing Buffer Memory Settings • ‘Read-Ahead Buffer Size’ is the buffer containing data read from source db • Defined in Server Property Dialog. Default: 4Meg • Rarely important – little effect when changed • Data is processed in chunks of ‘Process Buffer Size’ • Defined in Server Property Dialog • Data is clustered within Process Buffer Size • Bigger Process Buffer Size the better – make as big as possible • Data for dimension members is clustered to keep data for ‘close’ members close together • The larger these memory settings are, the more effective clustering
Incremental Processing Original partition Two Step Process • First, a partition is created with the incremental data • Second, the partition is merged with the original • Complete Segments of both partitions left intact – incomplete ones are merged • After many incremental processes, data distributed: degraded performance • Reprocess (if you have a large Process Buffer size) can provide improved performance … … … + … Incremental Partition
Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count
Querying a Cube • CLIENT • Select {[North America],[USA],[Canada]} on rows, Measures.members on columns from myCube • Need two things: • getting the dimension members – the axes • getting the data
Resolve Axis • Dimension members cached on client in ‘Client Member Cache’ • Levels with #members < Large Level Threshold sent in group • Levels with #members > Large Level Threshold retrieved as needed • Large Level Threshold default value:1000, can be changed in server property and in connection string • Where members not cached, members and descendents retrieved to client until needed member retrieved • Levels with members with 1000s of siblings result in degraded performance • Member cache not cleaned except for disconnect or when cube structure changes. Cached members ‘Christmas trees’ Requested member Non-cached members
Client Data Cache • Client retains data of previous queries in client data cache • Client Cache Size property controls how much data is in the client cache • When 0: unlimited • 1-99 (inclusive), percent of physical memory • >99 use up to the value in KB • Default value: 25 • When exceeded, client cache is cleaned at cube granularity
How Cubes Are Queried Client Service Data on disk Query Processor Client Data Cache Partition: Canada Segment 1 Segment 4 Segment 2 Segment 5 Dimension Memory Segment 3 Segment 6 Partition: Mexico Segment 1 Segment 4 Cache Memory Segment 2 Segment 5 Segment 3 Segment 6 Partition: USA Segment 1 Segment 4 Segment 2 Segment 5 Segment 3 Segment 6
Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count
Service Start Up • Minimum Allocated Memory defines the amount of memory completely dedicated to the server • All dimensions in the database are retained in memory • Tip: invalidate a dimension if not used in a cube • Dimension requirements: ~125 bytes per member plus member properties • 1M members: 125M • With 25 char member property (eg, Address): 175M • Large dimensions can migrate to separate process space Dimension Memory Minimum allocated memory
During Processing • Shadow dimensions • 2 copies of dimensions stored in memory while processing • Processing Buffers • Read Ahead Buffer size • Process Buffer Size • If dimension and processing buffers memory requirements exceed Memory Conservation Threshold - no room for data cache Dimension Memory Minimum allocated memory Memory conservation threshold Shadow Dimensions Processing Buffers Available Cache
During Querying Minimum allocated memory Dimension Memory Memory conservation threshold • Data cache stores query data for reuse • Faster than retrieving from storage • If Dimension Memory requirements > Memory Conservation Threshold, no Data Cache • ‘Cleaner’ wakes up periodically to reclaim memory from data cache • BackgroundInterval registry setting. Default value: 30 seconds Available Cache <= 0.5 * (Minimum Allocated Memory+ Memory Conservation Threshold): No cleaning • 0.5 * (Minimum Allocated Memory + Memory Conservation Threshold) and < Memory Conservation Threshold: mild cleaning • Memory Conservation Threshold: aggressive cleaning
demo Setting Server Properties
Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count
Sales Number of Customers All products 8000 200 Hardware 3300 80 Computers 2000 70 Monitors 800 60 Printers 500 30 Software 4700 150 Home 1500 100 Business 2500 100 Games 700 80 Distinct Count • Business Problem: Sales Manager wants to know: • “How many customers are buying Computers?” • “How many active customers do I have?”
Distinct Count: Changes to Data Structure • DC Measure stored with each fact and aggregation record • Just like a new dimension • Data ordered by DC measure • “Order by” included in SQL statement during processing • Number of records can be increased by orders of magnitude • Dependant on number of distinct values per record Sample aggregate record without Distinct Count… …# records increases with distinct count on customers
Distinct Count: Changes to Query • Single thread per partition instead of per segment • Unlike regular cubes, cannot do a single aggregation of results from each segment as a single value of the DC measure can cross segments • Consequently – performance impact • Dimension slice requires much more disk scan than before • Segments clustered by DC measure • Expensive
Distinct Count Tips • Keep DC measures in their own cube • All measures are retrieved on query – even if some are not asked for • Create virtual cube to merge DC with other measures • Incremental processing DC cubes is very expensive • Segments restructured and reordered to keep records ordered by DC measure • Time and memory intensive
Distinct Count Tips • Unlike regular cubes, best to distribute DC values evenly across each partition • Most effective use of multiple threads for query processing • If you have a dimenion that corresponds to DistinctCount Measure • Aggregations recommended only on lowest level • (Example, Customer dimension in cube, Customer as distinct count measure)
Summary • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count
Don’t forget to complete the on-line Session Feedback form on the Attendee Web sitehttps://web.mseventseurope.com/teched/