300 likes | 422 Views
Lecture 3. Themes in this session Basics of the multidimensional data model and star-join schemata The process of, and specific design issues in, multidimensional data modelling. Why use Multidimensional modelling. Simple, intuitive design basis easy to create a multidimensional model
E N D
Lecture 3 Themes in this session • Basics of the multidimensional data model and star-join schemata • The process of, and specific design issues in, multidimensional data modelling
Why use Multidimensional modelling • Simple, intuitive design basis • easy to create a multidimensional model • easy to communicate the meaning of the model • easy to gain an overview of the model • A logical model which can be implemented in a variety of databases • relational databases • multidimensional databases • object-oriented databases • Supports the reporting and analytical needs of business users • Tried and tested
General structure for a multidimensional model • A central fact table, referred to as a multidimensional data subject • Surrounding dimension tables, referred to as single dimensional data subjects • Joins connecting the fact table and its surrounding dimension tables. Only one join per dimension table • A concatenated or multipart key in the fact table which is comprised of one key from each dimension table. NOTE: the multidimensional model represents a n-dimensional matrix, with n being the number of dimensions
The fact table • Measures or facts • reflect focal events or snapshots of states of being • vary continuously over time • All the facts have a specific granularity • Facts should ideally be additive but this is not always the case • A set of foreign keys constituting a concatenated key • Contains the major volume of data
The dimension tables • Dimensions are often referred to as causal dimensions, they contain the causal factors responsible for the collected measures • The time dimension is not a causal dimension, it is however one of the most important dimensions for structuring and analysing data • The dimension tables contain dimensional attributes, these are usually textual and discrete and must have a relevant business meaning • Good dimensional attributes are stable across time • If the attributes are connected in one or more hierarchies then these are usually captured in the dimension table
Skinny fact tables • As the fact table contains the vast volume of records it is important that it is memory space efficient. • Foreign keys are usually represented in integer from and do not require much memory space • Facts too are often numeric properties and can usually be represented as integers (contrast to dimensional attributes which are usually long text strings) • This space efficiency is critical to the memory space consumption of the data warehouse
Aggregation • Lowest level of aggregation is determined by the granularity of the fact table. • Aggregations can be created on-the-fly or by the process of pre-aggregation • Pre-aggregation demands more storage space but provides better query performance • Aggregation is easier when facts are all additive
Sparsity • The matrices, represented by multidimensional models are often 99% sparse. • Sparsity is dealt with by simply not creating records for the cells that are not filled in the matrix. If nothing has happened no record is created. • Pre-aggregation and storage of aggregates can however lead to sparsity failure which places large demands on data storage
De-normalisation In 3rd normal from all mutually independent and fully dependent on the primary key • The fact table is by nature highly normalised in a star-join schema • Dimensions are however usually not normalised
Snowflakes and normalised dimension tables • “Any attempts to normalise dimension tables in order to save disk space are a waste of time” • Affects the intuitive understandability of the diagram • Normalised dimension tables destroy the ability to “browse” • Normalised tables demand extra joins and the querying of snowflakes take s longer than the querying of standard star-join schemata
What to focus on in MDM • Query optimised database • Whole business entities • Key business activities and influences • Transaction history • People, places and things • Time • Dimension and rollup
Basic steps in modelling • Select a business subject area • Identify which business process(es) is being modelled • Identify the basic measures or facts • Determine at what level of detail (granularity) active analysis is conducted • Determine what the measures have in common (identify the dimensions) • Identify the relevant attributes in the various dimensions • Determine if the attributes are stable or variable over time and if their cardinality is bounded or unbounded
Identifying facts • The purpose of the analyst must be supported by the facts in the fact table, there must be measures which have relevance to the business goals which the organisation seeks to fulfil • Facts are by nature dynamic and variable over time • They do not have a limited cardinality • Facts have their origins in the working of the organisation and the activities it performs
Identifying dimensional attributes • Dimensional attributes are those predictive variables business users believe are of significance to the measures in the fact table • Dimensional attributes are often present in hierarchies in the causal dimensions • Dimensional attributes usually have a limited cardinality and are non-variable across time.
Supporting multiple hierarchies in dimensions • Dimensions should be able to support multiple independent hierarchies • Alternate hierarchies are easily supported • Cyclic paths in two hierarchies demand that the hierarchies be split into two entirely separate hierarchies or even two separate dimensions • A dimension can also contain attributes that do not have any hierarchical relationships to the other attributes in the dimension NOTE: Any of the attributes, whether in a hierarchy or not, can be used in the drill down process
Factless fact tables • Some fact tables quite simply have no measured facts! • Often used to represent many-to-many relationships • The only thing they contain is a concatenated key, they do still however represent a focal event which is identified by the combination of conditions referenced in the dimension tables • There are two main types of factless fact tables: • event tracking tables • coverage tables
Dealing with many-to-many relationships among dimensional attributes • Many to many relationships are difficult to deal with in a any database design situation. Great efforts should be taken to identify any in the data model • When creating a MDM it is necessary to separate the two entities and capture their relationship in a factless fact table
Dealing with semi-additive and non-additive facts • Semi-additive facts are those which are not additive across all dimensions • warn users • prohibit the addition of these facts across the relevant dimensions • Non-additive facts are not additive across any dimensions • most ratios and all measures depicting snapshots of a state fall into this class • in some cases other calculatory methods can be used to aggregate these measures • average over the number of time periods • calculate the ratio of the sums not the sum of the ratios
Degenerate dimensions • A dimension which has been cannibalised by other dimensions • Represented as a dimension key without a corresponding dimension • The key to this dimension still serves as a vital element for the grouping of facts • This often happens when the grain in the fact tables represents actual working documents
Special mention - handling unit prices and costs • Could usually be considered to be dimension attributes of a product • However, every price change would then result in a change in one of the dimensions (remember these are supposed to be stable) • Instead they are included as non-additive facts in the fact table. These are often used in a lot of derived measures which are presented to the business user in a specially created view • This is once again a snapshot of a state, the state being the cost or price of the goods at a specific point in time
Depicting processes and value chains as families of star-join schemas • Kimball speaks of their being two sides to the value chain • the demand side - the steps needed to satisfy the customers’ demand for the product • the supply side - the steps needed to manufacture the products from original ingredients or parts • The chain consists of a sequence of inventory and flow star-join schemata • joining the different star-join schemata is only possible when two sequential schemata have a common, identical dimension • Sometimes the represented chain can be extended beyond the bounds of the business itself
Creating mini-dimensions for really large dimensions • Many dimension attributes are used very frequently as browsing constraints, in big dimensions these constraints can be hard to “find” among the lesser used ones • Logical groups of often used constraints can be separated into small dimensions which are very well indexed and easily accessible for browsing • All variables in these mini-dimensions must be presented as distinct bands or classes • The key to the mini-dimension can be places as a foreign key in both the fact table and dimension table from which it has been broken off • Mini-dimensions, as their name suggests, should be kept small and compact
Slowly changing dimensions • Most dimensions are not constant over time • Most dimensions are however almost constant over time • Almost constant dimensions are referred to as slowly-changing dimensions • There are three main methods of handling slowly changing dimensions: • Overwriting • Creating additional dimensional records • Creating new current fields within the original dimension Note: One of the key functions of the data warehouse is to track events over extended periods of time. The validity of the data warehouse is thus dependent on how well changes in the its dimensions are tracked.
Slowly changing dimensions (1) • The dimensional attribute record is overwritten with the new value • No changes are needed elsewhere in the dimension record • No keys are affected anywhere in the database • Very easy to implement but the historical data is now inconsistent • Two basic questions need to be asked before overwriting a dimension attribute: • How important is the value to the end-users analysis needs? • How important is the tracking of history?
Slowly changing dimensions (2) • Introduce a new record for the same dimensional entity in order to reflect its changed state • A new instance of the dimensional key is created which references the new record • In order to is best dealt with by using version digits at the end of the key. These allow up to 100 snapshots of a changing dimensional entity • All these keys need to be created, maintained and managed by someone and tracked in the metadata • The database maintains its consistency and the versions can be said to partition history
Slowly changing dimensions (3) • Use slightly different design of dimension table which has fields for: • original status of dimensional attribute • current status of dimensional attribute • an effective date of change field • This allows the analyst to compare the as-is and as-was states against each other • Only two states can be traced, the current and the original • Some inconsistencies are created in the data as time is not properly partitioned
Special mention - Heterogeneous products Some products have many, many distinguishing attributes and many possible permutations (usually on the basis of some customised offer). This results in immense product dimensions and bad browsing performance • In order to deal with this, fact tables with accompanying product dimensions can be created for each product type - these are known as custom fact tables • Primary core facts on the products types are kept in a core fact table • The core facts are copied in each of the customer fact tables
Pre-aggregated data in SJS • How to deal with aggregates is one of the biggest issues in the design of a DW • Choice of pre-aggregation/on-the fly aggregation has great relvance to data storage and query performance • Two main strategies exist for the creation of aggregates: • the creation of new fact tables for aggregates • the creation of new level fields for aggregates