1 / 42

Data Warehouse : Modeling and Design

Data Warehouse : Modeling and Design. N. L. Sarda. Outline. Introduction Warehouse structure A case study Dimensional analysis. Introduction. DW is a single, complete and consistent store of data from different sources to understand & analyze the business Contains history data

johannar
Download Presentation

Data Warehouse : Modeling and Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehouse : Modeling and Design N. L. Sarda NLS/IITB/DWH

  2. Outline • Introduction • Warehouse structure • A case study • Dimensional analysis NLS/IITB/DWH

  3. Introduction • DW is a single, complete and consistent store of data from different sources to understand & analyze the business • Contains history data • Warehouse to facilitate browsing, navigating, aggregating and visualization of related data to understand performance, problems, customer preferences, trends, etc. • Warehouse data organized by important business subjects (customer, product, etc…) NLS/IITB/DWH

  4. Warehouse Structure • Organized to facilitate ease of access and aggregation • warehouse structure decomposed into dimensions and facts • Dimensions like ‘independent variables’, represent entities for analysis • Fact represents business data; relates to a set of dimensions • Eg : customer, time, type of account are dimensions, and balances are facts NLS/IITB/DWH

  5. Warehouse Structure... • The complex network of business entities and their relationships as depicted in an operational DB (using, say, ER model) is difficult for navigation and analysis • A ‘2-level’ structure defined by ‘star schema’ is performed where a fact is at the center and dimensions form ‘spokes’ • Data not stored in ‘normalized’ form NLS/IITB/DWH

  6. Star Schema • Contains a fact table and for each dimension one dimension table date, custno, prodno, cityname, ... Time Prod f a c t Cust City NLS/IITB/DWH

  7. Dimensions • Stored as a database table • Contains many descriptive attributes for analysis • Small and slowly changing data • Data often group-able for analysis • Customers by age, occupation, income level • Time by weeks, months, years • Branches as rural, suburban or by size • Thus, dimension data viewable as a hierarchy NLS/IITB/DWH

  8. Facts • Contain business activity data • May be at detailed level or status level; called transaction-oriented or snap-shot oriented • Deciding on granularity : every sale or total sales of a day ? • Often contain numeric attributes for aggregation (additive, semi-additive,…) • Contain dimensional table keys also NLS/IITB/DWH

  9. Snowflake Schema • Hierarchies not captured explicitly in a star schema • Snowflake schema represents hierarchy directly • Saves on storage but requires more join NLS/IITB/DWH

  10. Snowflake Schema • Represent dimensional hierarchy directly by normalizing tables. p r o d T i m e date, custno, prodno, cityname, ... f a c t r e g i o n c i t y c u s t NLS/IITB/DWH

  11. Conformed Dimensions and Facts • Goal is to produce a master suite of conformed dimensions and to standardize facts • conformed dimension means same thing with every fact table (eg., customer, time, geography) • it may contain data brought together from many sources • ensures same units and meaning, same time durations and geographies across marts NLS/IITB/DWH

  12. Financial Services : A Case Study • A bank offers various products/services like saving/checking accounts, mortgage loans, personal loans, TD, credit cards, etc… • Purpose : track various a/c, customer profiles, etc…, for marketing and offering new services • Requirements: • Get end-of-month summary of a/c for last 5 years • Valid snapshot as of yesterday for current month (with full details) • Ability to group a/c in various ways & compare balances • demographic behavior NLS/IITB/DWH

  13. Case Study ... • Each account type has some unique attributes (requiring customized dimension and facts for each) • Old data (a/c & customers ) may be incomplete or even different • The warehouse data may come from multiple sources : • Loan processing system(customer,loan,dues,payment) • Fixed deposit system(customer,TD,…) • Front-office system(customer, account, transaction,..) • Credit-card system customer, transactions, interest,..) NLS/IITB/DWH

  14. Case Study ... • Must plan extraction, correlation, consistent representation,… • Let us consider a possible warehouse design for the indicated requirements • Core fact table : balance in each account, # of transactions, grain : month • Dimensions : a/c, household, branch, product, status, time • A/c and household separate : many accounts per family; household definitions change NLS/IITB/DWH

  15. Case Study ... • Product dimension permits hierarchy and defining specific attributes; separate because it changes • Status : active or not, closed, etc. with reasons • Account contains customer’s data; for historical reasons, customer to accounts relationship not well maintained NLS/IITB/DWH

  16. The household data warehouse account key primary_name secondary_name account_address account_city account_state account_zip date_opened primary_age primary_sex primary_marital branch key branch-name branch_address branch_city branch_state branch_zip branch_type Household Facts account_key household_key branch_key product_key status_key time_key primary_balance transaction_count product key product_description type category household key household_head_name household_address household_city household_state household_zip household_income household_type status key status_description status_reason new_account_flag closed_account_flag time key month year fiscal_quarter NLS/IITB/DWH

  17. Case Study ... • Balance is semi-additive : can not be added across time • Products highly heterogeneous : different attributes characterize different accounts (balance, deposit options, interest rate, over draft limit,..) • Can’t combine all in a dimension as many not applicable to all products NLS/IITB/DWH

  18. Case Study ... • Solution: create many facts, customized for each product, and one core fact with a product dimension having common attributes; leads to 100% replication, but facilitates clarifications, browsing, etc. and avoids join of customized and core facts • When many facts are to be stored together go for snapshots (eg. monthly) NLS/IITB/DWH

  19. Case Study ... • Transaction-grained facts usually have a single fact (eg. amount) that is directly involved in the transaction; we need a transaction dimension to represent these amounts • In transaction grained fact table, we do not need customized fact tables per product; instead we create customized dimension tables NLS/IITB/DWH

  20. Data Warehouse Life Cycle Business Requirement Definition Technical Architecture Design Product Selection & Installation Project planning Deploy- ment Main- tenence & Growth Dimensional Modeling Physical Design Data Staging Design & Development End-User Application Specification End-User Application Development Project Management NLS/IITB/DWH

  21. Life Cycle : summary • Project planning • Business requirements definition • Data track • Dimensional modeling • Physical design • Data staging design and development • Technology track • Technical architectural design • Product selection and and installation NLS/IITB/DWH

  22. Life Cycle... • Application track • End user application specification • End user application development • Deployment • Maintenance and growth • Project management NLS/IITB/DWH

  23. Collecting Requirements... • Interviews/write-ups • Requirements findings document • Project overview • review of business objectives • analytic and information requirements • preliminary source systems analysis • Preliminary success criteria • Prepare and publish the requirements NLS/IITB/DWH

  24. Collecting Data about Existing Systems • Understanding the candidate data sources • Detailed criteria for selecting the data sources • Data accessibility • Longevity of the feed • Data accuracy • Project scheduling • Customer matching and house-holding • Browsing and data content • Mapping data from source to target NLS/IITB/DWH

  25. Designing the Data Warehouse / Data Marts • Identifying marts and dimensions • identify marts based on facts likely to be used together, as a mart is a kind of subject area or application (divide-and-conquer strategy) • often based on a single business process or a single source • 10 to 30 marts common for a large organization • build a matrix of marts versus dimensions NLS/IITB/DWH

  26. Designing a Fact • Define fact grain based on the basic business facts stored in legacy systems • Choose dimensions and match them with granularity of facts • Combine as many facts as possible with the context of defined granularity NLS/IITB/DWH

  27. Detailed Design Tips • Names for dimensions and attributes should be chosen carefully to refer to corresponding business entities • An attribute (in a dimension) is not replicated, but a fact may be present in many fact tables • If a dimension occurs multiple times (eg, time), it is playing multiple roles; name them uniquely • Every fact should have a default aggregation rule so that it is not aggregated wrongly NLS/IITB/DWH

  28. Dimension Attributes • The quality of the data warehouse is measured by the quality of the dimension attributes • The user interface responses and final reports are restricted to the precise contents of the dimension table attributes • Properties • Verbose, descriptive, complete • Quality assured, indexed • Equally available, documented NLS/IITB/DWH

  29. Time Dimension • Every data warehouse fact table is a time series of some observations • We always seems to have one or more time dimensions in our fact table designs • Provides useful hierarchies : week, month, quarter, year, etc • Represents calendar with many useful attributes like day of week, day of month, week#, day#, quarter, weekday-flag, last-day-of-month-flag, holiday flag, etc. NLS/IITB/DWH

  30. Slowly Changing Dimensions • The product key or customer key does not change, but the description of the product or customer does • The data warehouse has three options for above changes • Overwrite the dimension record with the new values, thereby losing history • It is used whenever the old value of the attribute has no significance • The corrections of any error falls into this category NLS/IITB/DWH

  31. Slowly Changing Dimensions... • Create a new additional dimension record using a new value of the surrogate key • is primary technique for accurately tracking a change in an attribute within a dimension • requires use of a surrogate key • a slowly changing dimension is used when a true physical change to the dimension entity has taken place • Create an “old” field in the dimension record to store the immediate previous attribute value • It is used when a change is tentative NLS/IITB/DWH

  32. Time Stamping the Changes • The design of slowly changing dimension may be established by adding begin and end time stamps and a transaction description in each instance of a dimension record • This design allows very precise time slicing of the dimension by itself NLS/IITB/DWH

  33. Large Dimensions • Data warehouses that store extremely granular data may require some extremely large dimensions • To support large dimensions we must choose the indexing technologies and data design approaches that: • supports rapid browsing of the unconditional dimension, especially for low cardinality attributes • Supports efficient browsing of cross-constrained values in the dimension table • Find and suppress duplicate entries in the dimension NLS/IITB/DWH

  34. Foreign Key, Primary Key, Surrogate Key • All dimensional tables have single keys, which, by definition, are primary keys • All data warehouse keys must be meaningless surrogate keys; you must not use the original production keys • A four byte integer makes a good surrogate key • Surrogate date keys • Avoid smart keys • Avoid production keys NLS/IITB/DWH

  35. Heterogeneous Product Schemas • Multiple fact tables are needed when a business has heterogeneous products • The global view needs a single core fact table crossing all lines of business, whereas local view focuses on specific product • There are many attributes and facts which apply only to a specific product; a single fact table is not feasible • create customized fact and (product) dimension table for each product, and build a core fact table with attributes that make sense across all lines of business; this allows to create a single portfolio (of products) for each customer NLS/IITB/DWH

  36. Transaction Schema • Every data mart needs two separate models • Transaction version • Periodic snapshot version • ‘rolling’ snapshot containing averages across time • Snapshots allow us to quickly measure the status of the enterprise • The Transaction schema • low level transactions in the organization makes for a good dimensional frame work • The fact record for an individual transaction frequently contains only a single value NLS/IITB/DWH

  37. Transaction Schema.. • The transaction-based WH commonly used in • Time of day analysis • Queue analysis • Fraud detection • Basket analysis • Current status NLS/IITB/DWH

  38. Factless Fact Tables • useful to describe events and their coverage • an event fact table records occurrence of an event; has only flag and dimension keys (eg, student attendance) • coverage fact table is frequently needed when a primary fact table in a dimensional data warehouse is sparse; eg, primary fact table will not provide items which were on promotion but did not sale; the coverage table, containing only dimension keys, lists all items on sale NLS/IITB/DWH

  39. Facts of Different Granularity • The dimensional model gains power as the individual fact records become more and more atomic • At the lowest level of individual transactions, the design is most powerful because • More of the descriptive attributes have single values • The design withstands surprise in the form of new facts, new dimensions, or new attributes within existing dimensions • More expressiveness at the lowest levels of granularity NLS/IITB/DWH

  40. Metadata Catalog • It is an integral part of the overall architecture • It contains information that describes the warehouse and plays an active role in its creation, use, and maintenance • Contains source system metadata (data and processes), data staging metadata (dimensions, transformations, aggregations), DBMS metadata (tables, indexes, stored procedures), and front-room metadata (users, applications) NLS/IITB/DWH

  41. Technical Architecture • Metadata driven • Metadata provides flexibility by buffering the various components of the system from each other • The metadata catalog provides parameters and information that allow the application to perform their task NLS/IITB/DWH

  42. Conclusion • Building a corporate-wide data warehouse is a challenging task • A systematic methodology essential • Plan the architecture globally but build it incrementally • Keep user requirements at the core of all development activities NLS/IITB/DWH

More Related