430 likes | 464 Views
Data Warehouse : Modeling and Design. N. L. Sarda. Outline. Introduction Warehouse structure A case study Dimensional analysis. Introduction. DW is a single, complete and consistent store of data from different sources to understand & analyze the business Contains history data
E N D
Data Warehouse : Modeling and Design N. L. Sarda NLS/IITB/DWH
Outline • Introduction • Warehouse structure • A case study • Dimensional analysis NLS/IITB/DWH
Introduction • DW is a single, complete and consistent store of data from different sources to understand & analyze the business • Contains history data • Warehouse to facilitate browsing, navigating, aggregating and visualization of related data to understand performance, problems, customer preferences, trends, etc. • Warehouse data organized by important business subjects (customer, product, etc…) NLS/IITB/DWH
Warehouse Structure • Organized to facilitate ease of access and aggregation • warehouse structure decomposed into dimensions and facts • Dimensions like ‘independent variables’, represent entities for analysis • Fact represents business data; relates to a set of dimensions • Eg : customer, time, type of account are dimensions, and balances are facts NLS/IITB/DWH
Warehouse Structure... • The complex network of business entities and their relationships as depicted in an operational DB (using, say, ER model) is difficult for navigation and analysis • A ‘2-level’ structure defined by ‘star schema’ is performed where a fact is at the center and dimensions form ‘spokes’ • Data not stored in ‘normalized’ form NLS/IITB/DWH
Star Schema • Contains a fact table and for each dimension one dimension table date, custno, prodno, cityname, ... Time Prod f a c t Cust City NLS/IITB/DWH
Dimensions • Stored as a database table • Contains many descriptive attributes for analysis • Small and slowly changing data • Data often group-able for analysis • Customers by age, occupation, income level • Time by weeks, months, years • Branches as rural, suburban or by size • Thus, dimension data viewable as a hierarchy NLS/IITB/DWH
Facts • Contain business activity data • May be at detailed level or status level; called transaction-oriented or snap-shot oriented • Deciding on granularity : every sale or total sales of a day ? • Often contain numeric attributes for aggregation (additive, semi-additive,…) • Contain dimensional table keys also NLS/IITB/DWH
Snowflake Schema • Hierarchies not captured explicitly in a star schema • Snowflake schema represents hierarchy directly • Saves on storage but requires more join NLS/IITB/DWH
Snowflake Schema • Represent dimensional hierarchy directly by normalizing tables. p r o d T i m e date, custno, prodno, cityname, ... f a c t r e g i o n c i t y c u s t NLS/IITB/DWH
Conformed Dimensions and Facts • Goal is to produce a master suite of conformed dimensions and to standardize facts • conformed dimension means same thing with every fact table (eg., customer, time, geography) • it may contain data brought together from many sources • ensures same units and meaning, same time durations and geographies across marts NLS/IITB/DWH
Financial Services : A Case Study • A bank offers various products/services like saving/checking accounts, mortgage loans, personal loans, TD, credit cards, etc… • Purpose : track various a/c, customer profiles, etc…, for marketing and offering new services • Requirements: • Get end-of-month summary of a/c for last 5 years • Valid snapshot as of yesterday for current month (with full details) • Ability to group a/c in various ways & compare balances • demographic behavior NLS/IITB/DWH
Case Study ... • Each account type has some unique attributes (requiring customized dimension and facts for each) • Old data (a/c & customers ) may be incomplete or even different • The warehouse data may come from multiple sources : • Loan processing system(customer,loan,dues,payment) • Fixed deposit system(customer,TD,…) • Front-office system(customer, account, transaction,..) • Credit-card system customer, transactions, interest,..) NLS/IITB/DWH
Case Study ... • Must plan extraction, correlation, consistent representation,… • Let us consider a possible warehouse design for the indicated requirements • Core fact table : balance in each account, # of transactions, grain : month • Dimensions : a/c, household, branch, product, status, time • A/c and household separate : many accounts per family; household definitions change NLS/IITB/DWH
Case Study ... • Product dimension permits hierarchy and defining specific attributes; separate because it changes • Status : active or not, closed, etc. with reasons • Account contains customer’s data; for historical reasons, customer to accounts relationship not well maintained NLS/IITB/DWH
The household data warehouse account key primary_name secondary_name account_address account_city account_state account_zip date_opened primary_age primary_sex primary_marital branch key branch-name branch_address branch_city branch_state branch_zip branch_type Household Facts account_key household_key branch_key product_key status_key time_key primary_balance transaction_count product key product_description type category household key household_head_name household_address household_city household_state household_zip household_income household_type status key status_description status_reason new_account_flag closed_account_flag time key month year fiscal_quarter NLS/IITB/DWH
Case Study ... • Balance is semi-additive : can not be added across time • Products highly heterogeneous : different attributes characterize different accounts (balance, deposit options, interest rate, over draft limit,..) • Can’t combine all in a dimension as many not applicable to all products NLS/IITB/DWH
Case Study ... • Solution: create many facts, customized for each product, and one core fact with a product dimension having common attributes; leads to 100% replication, but facilitates clarifications, browsing, etc. and avoids join of customized and core facts • When many facts are to be stored together go for snapshots (eg. monthly) NLS/IITB/DWH
Case Study ... • Transaction-grained facts usually have a single fact (eg. amount) that is directly involved in the transaction; we need a transaction dimension to represent these amounts • In transaction grained fact table, we do not need customized fact tables per product; instead we create customized dimension tables NLS/IITB/DWH
Data Warehouse Life Cycle Business Requirement Definition Technical Architecture Design Product Selection & Installation Project planning Deploy- ment Main- tenence & Growth Dimensional Modeling Physical Design Data Staging Design & Development End-User Application Specification End-User Application Development Project Management NLS/IITB/DWH
Life Cycle : summary • Project planning • Business requirements definition • Data track • Dimensional modeling • Physical design • Data staging design and development • Technology track • Technical architectural design • Product selection and and installation NLS/IITB/DWH
Life Cycle... • Application track • End user application specification • End user application development • Deployment • Maintenance and growth • Project management NLS/IITB/DWH
Collecting Requirements... • Interviews/write-ups • Requirements findings document • Project overview • review of business objectives • analytic and information requirements • preliminary source systems analysis • Preliminary success criteria • Prepare and publish the requirements NLS/IITB/DWH
Collecting Data about Existing Systems • Understanding the candidate data sources • Detailed criteria for selecting the data sources • Data accessibility • Longevity of the feed • Data accuracy • Project scheduling • Customer matching and house-holding • Browsing and data content • Mapping data from source to target NLS/IITB/DWH
Designing the Data Warehouse / Data Marts • Identifying marts and dimensions • identify marts based on facts likely to be used together, as a mart is a kind of subject area or application (divide-and-conquer strategy) • often based on a single business process or a single source • 10 to 30 marts common for a large organization • build a matrix of marts versus dimensions NLS/IITB/DWH
Designing a Fact • Define fact grain based on the basic business facts stored in legacy systems • Choose dimensions and match them with granularity of facts • Combine as many facts as possible with the context of defined granularity NLS/IITB/DWH
Detailed Design Tips • Names for dimensions and attributes should be chosen carefully to refer to corresponding business entities • An attribute (in a dimension) is not replicated, but a fact may be present in many fact tables • If a dimension occurs multiple times (eg, time), it is playing multiple roles; name them uniquely • Every fact should have a default aggregation rule so that it is not aggregated wrongly NLS/IITB/DWH
Dimension Attributes • The quality of the data warehouse is measured by the quality of the dimension attributes • The user interface responses and final reports are restricted to the precise contents of the dimension table attributes • Properties • Verbose, descriptive, complete • Quality assured, indexed • Equally available, documented NLS/IITB/DWH
Time Dimension • Every data warehouse fact table is a time series of some observations • We always seems to have one or more time dimensions in our fact table designs • Provides useful hierarchies : week, month, quarter, year, etc • Represents calendar with many useful attributes like day of week, day of month, week#, day#, quarter, weekday-flag, last-day-of-month-flag, holiday flag, etc. NLS/IITB/DWH
Slowly Changing Dimensions • The product key or customer key does not change, but the description of the product or customer does • The data warehouse has three options for above changes • Overwrite the dimension record with the new values, thereby losing history • It is used whenever the old value of the attribute has no significance • The corrections of any error falls into this category NLS/IITB/DWH
Slowly Changing Dimensions... • Create a new additional dimension record using a new value of the surrogate key • is primary technique for accurately tracking a change in an attribute within a dimension • requires use of a surrogate key • a slowly changing dimension is used when a true physical change to the dimension entity has taken place • Create an “old” field in the dimension record to store the immediate previous attribute value • It is used when a change is tentative NLS/IITB/DWH
Time Stamping the Changes • The design of slowly changing dimension may be established by adding begin and end time stamps and a transaction description in each instance of a dimension record • This design allows very precise time slicing of the dimension by itself NLS/IITB/DWH
Large Dimensions • Data warehouses that store extremely granular data may require some extremely large dimensions • To support large dimensions we must choose the indexing technologies and data design approaches that: • supports rapid browsing of the unconditional dimension, especially for low cardinality attributes • Supports efficient browsing of cross-constrained values in the dimension table • Find and suppress duplicate entries in the dimension NLS/IITB/DWH
Foreign Key, Primary Key, Surrogate Key • All dimensional tables have single keys, which, by definition, are primary keys • All data warehouse keys must be meaningless surrogate keys; you must not use the original production keys • A four byte integer makes a good surrogate key • Surrogate date keys • Avoid smart keys • Avoid production keys NLS/IITB/DWH
Heterogeneous Product Schemas • Multiple fact tables are needed when a business has heterogeneous products • The global view needs a single core fact table crossing all lines of business, whereas local view focuses on specific product • There are many attributes and facts which apply only to a specific product; a single fact table is not feasible • create customized fact and (product) dimension table for each product, and build a core fact table with attributes that make sense across all lines of business; this allows to create a single portfolio (of products) for each customer NLS/IITB/DWH
Transaction Schema • Every data mart needs two separate models • Transaction version • Periodic snapshot version • ‘rolling’ snapshot containing averages across time • Snapshots allow us to quickly measure the status of the enterprise • The Transaction schema • low level transactions in the organization makes for a good dimensional frame work • The fact record for an individual transaction frequently contains only a single value NLS/IITB/DWH
Transaction Schema.. • The transaction-based WH commonly used in • Time of day analysis • Queue analysis • Fraud detection • Basket analysis • Current status NLS/IITB/DWH
Factless Fact Tables • useful to describe events and their coverage • an event fact table records occurrence of an event; has only flag and dimension keys (eg, student attendance) • coverage fact table is frequently needed when a primary fact table in a dimensional data warehouse is sparse; eg, primary fact table will not provide items which were on promotion but did not sale; the coverage table, containing only dimension keys, lists all items on sale NLS/IITB/DWH
Facts of Different Granularity • The dimensional model gains power as the individual fact records become more and more atomic • At the lowest level of individual transactions, the design is most powerful because • More of the descriptive attributes have single values • The design withstands surprise in the form of new facts, new dimensions, or new attributes within existing dimensions • More expressiveness at the lowest levels of granularity NLS/IITB/DWH
Metadata Catalog • It is an integral part of the overall architecture • It contains information that describes the warehouse and plays an active role in its creation, use, and maintenance • Contains source system metadata (data and processes), data staging metadata (dimensions, transformations, aggregations), DBMS metadata (tables, indexes, stored procedures), and front-room metadata (users, applications) NLS/IITB/DWH
Technical Architecture • Metadata driven • Metadata provides flexibility by buffering the various components of the system from each other • The metadata catalog provides parameters and information that allow the application to perform their task NLS/IITB/DWH
Conclusion • Building a corporate-wide data warehouse is a challenging task • A systematic methodology essential • Plan the architecture globally but build it incrementally • Keep user requirements at the core of all development activities NLS/IITB/DWH