1 / 23

TYPICAL DATA WAREHOUSING NEEDS

Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business Intelligence Peers Neelam Mohanty. TYPICAL DATA WAREHOUSING NEEDS. Identify when changes are made to the source systems Uniquely identify a source record

Download Presentation

TYPICAL DATA WAREHOUSING NEEDS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Business Analysis Professional Development Day – Sep 2014How to understand and deliver requirements to your Business Intelligence Peers Neelam Mohanty

  2. TYPICAL DATA WAREHOUSING NEEDS • Identify when changes are made to the source systems • Uniquely identify a source record • Be usable – allow users to drill up , drill down or across • Handle overlapping data from multiple systems • Allocate header level facts to detail line-item data • Handle semantic complexity • Eliminate mismatches when integrating data from multiple sources • Store historical and current data

  3. REQUIREMENTS FROM SOURCE SYSTEMS • Change Data Capture (CDC) • Natural Key • Data Quality • Hierarchies • Grain • History • Conflict Resolution • Semantic Complexity

  4. CHANGE DATA CAPTURE (CDC) • Why the need to identify changes? • Data Warehouses need to identify slowly changing dimensions • Master Data Management (MDM) needs to identify when changes were made to master data attributes • Capture changes as they happen in source systems for improving Data Quality • Data warehouses want to transfer only the relevant changes to the source data since the last transfer (also called incremental loading). Doing a ‘full refresh’ is usually undesirable other than for smaller tables • Capture deletions, edits and inserts

  5. CHANGE DATA CAPTURE (CDC) • Why is CDC important? • Reduces load time, required resources and associated costs • Provides a solution for the continued and accelerating growth in data volumes • Supports lower delivery latencies for real-time data warehousing • Mitigates the risk of failure in long-running ETL jobs. • Four ways of implementing CDC: • Database Log Scraping • Audit Columns • Timed Extracts • Full database “diff compare”

  6. UNIQUE IDENTIFIER/NATURAL KEY • Why is a unique identifier important? • Data Warehouses typically store years of history –the source system may only store the last three months of history and reuse the unique identifier • Data Warehouses typically need to integrate multiple source systems • Knowing a birthdate, gender and ZIP code is enough to identify 87 percent of the people living in the United States • Fuzzy vs Exact matches

  7. DATA QUALITY • Why does it happen? • Bad businesses processes • Inadequate software training • Bad software design and implementation • Consequences on BI • Bad or delayed decisions • Revenue Impacts • Money spent in standardizing or cleansing data downstream • Affects ability to reach customers, cross-sell and up-sell • Efficiency Impacts • Time and resources spent on fixing data quality issues

  8. HIERARCHIES • A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels collectively makes up a dimension or a tree. • Why important? • Usability - Hierarchies are the key to navigating dimensions • Query Performance • Level • A position in a hierarchy • Hierarchies are not always officially part of the source system

  9. THE HOLY GRAIN(L) • Most of the basic control documents that transfer money and goods (or services) from place to place take the parent child form. • An invoice (the parent) has many line items (the children). • Other obvious examples besides invoices include orders, bills of lading, insurance policies, and retail sales tickets. • Parent-child transaction databases commonly have facts of different granularity • Data Warehouses often try to allocate header level facts down to the child line item detail level • This provides the ability to slice and dice and roll up the facts to the child line-item detail

  10. HISTORY • Data Warehouses typically need to do a one-time history load when they go live • Also called initial load or initial/seeding load • The data in the operational systems may often be in multiple formats • It is believed that data quality gets poorer and poorer as the data gets older • Missing data value issues often must be dealt with when loading history

  11. SURVIVORSHIP (CONFLICT RESOLUTION) • Overlapping data from multiple systems • Survivorship (Conflict resolution) – Rules to define how to assemble a single record from two or more records with overlapping attributes that may contain conflicting values • Resolution strategies can be • Instance based – rely on the actual data values • Deciding versus Mediating strategies • Metadata based – rely on freshness of data or reliability of source or which data element was most recently updated

  12. SEMANTIC COMPLEXITY • Different users of a database have different conceptions of what the data represents • Take the example of a table of mobile phone numbers – there are values of all 0’s, nulls, all 9’s and valid digits • What does null mean? • The record is of someone who does not have a mobile phone • The record is of someone who has a mobile phone, and chose not to supply the number • The record is of someone who has a mobile phone, but forgot to supply the number or the number was hard to decipher and recorded as null • What do the all 0’s and all 9’s mean?

  13. QUESTIONS? • Contact me at nmohanty@navmp.com

  14. CHANGE DATA CAPTURE (CDC) • Why the need to identify changes? • Data Warehouses need to identify slowly changing dimensions • Master Data Management (MDM) needs to identify when changes were made to master data attributes • Capture changes as they happen in source systems for improving Data Quality • Data warehouses want to transfer only the relevant changes to the source data since the last transfer (also called incremental loading). Doing a ‘full refresh’ is usually undesirable other than for smaller tables • Capture deletions, edits and inserts

  15. CHANGE DATA CAPTURE (CDC) • Why is CDC important? • Reduces load time, required resources and associated costs • Provides a solution for the continued and accelerating growth in data volumes • Supports lower delivery latencies for real-time data warehousing • Mitigates the risk of failure in long-running ETL jobs. • Four ways of implementing CDC: • Database Log Scraping • Audit Columns • Timed Extracts • Full database “diff compare”

  16. UNIQUE IDENTIFIER/NATURAL KEY • Why is a unique identifier important? • Data Warehouses typically store years of history –the source system may only store the last three months of history and reuse the unique identifier • Data Warehouses typically need to integrate multiple source systems • Knowing a birthdate, gender and ZIP code is enough to identify 87 percent of the people living in the United States • Fuzzy vs Exact matches

  17. SURVIVORSHIP (CONFLICT RESOLUTION) • Overlapping data from multiple systems • Survivorship (Conflict resolution) – Rules to define how to assemble a single record from two or more records with overlapping attributes that may contain conflicting values • Resolution strategies can be • Instance based – rely on the actual data values • Deciding versus Mediating strategies • Metadata based – rely on freshness of data or reliability of source or which data element was most recently updated

  18. HIERARCHIES • A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels collectively makes up a dimension or a tree. • Why important? • Usability - Hierarchies are the key to navigating dimensions • Query Performance • Level • A position in a hierarchy • Hierarchies are not always officially part of the source system

  19. THE HOLY GRAIN(L) • Most of the basic control documents that transfer money and goods (or services) from place to place take the parent child form. • An invoice (the parent) has many line items (the children). • Other obvious examples besides invoices include orders, bills of lading, insurance policies, and retail sales tickets. • Parent-child transaction databases commonly have facts of different granularity • Data Warehouses often try to allocate header level facts down to the child line item detail level • This provides the ability to slice and dice and roll up the facts to the child line-item detail

  20. DATA QUALITY • Why does it happen? • Bad businesses processes • Inadequate software training • Bad software design and implementation • Consequences on BI • Bad or delayed decisions • Revenue Impacts • Money spent in standardizing or cleansing data downstream • Affects ability to reach customers, cross-sell and up-sell • Efficiency Impacts • Time and resources spent on fixing data quality issues

  21. HISTORY • Data Warehouses typically need to do a one-time history load when they go live • Also called initial load or initial/seeding load • The data in the operational systems may often be in multiple formats • It is believed that data quality gets poorer and poorer as the data gets older • Missing data value issues often must be dealt with when loading history

  22. SEMANTIC COMPLEXITY • Different users of a database have different conceptions of what the data represents • Take the example of a table of mobile phone numbers – there are values of all 0’s, nulls, all 9’s and valid digits • What does null mean? • The record is of someone who does not have a mobile phone • The record is of someone who has a mobile phone, and chose not to supply the number • The record is of someone who has a mobile phone, but forgot to supply the number or the number was hard to decipher and recorded as null • What do the all 0’s and all 9’s mean?

  23. QUESTIONS? • Contact me at nmohanty@navmp.com

More Related