230 likes | 300 Views
Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business Intelligence Peers Neelam Mohanty. TYPICAL DATA WAREHOUSING NEEDS. Identify when changes are made to the source systems Uniquely identify a source record
E N D
Business Analysis Professional Development Day – Sep 2014How to understand and deliver requirements to your Business Intelligence Peers Neelam Mohanty
TYPICAL DATA WAREHOUSING NEEDS • Identify when changes are made to the source systems • Uniquely identify a source record • Be usable – allow users to drill up , drill down or across • Handle overlapping data from multiple systems • Allocate header level facts to detail line-item data • Handle semantic complexity • Eliminate mismatches when integrating data from multiple sources • Store historical and current data
REQUIREMENTS FROM SOURCE SYSTEMS • Change Data Capture (CDC) • Natural Key • Data Quality • Hierarchies • Grain • History • Conflict Resolution • Semantic Complexity
CHANGE DATA CAPTURE (CDC) • Why the need to identify changes? • Data Warehouses need to identify slowly changing dimensions • Master Data Management (MDM) needs to identify when changes were made to master data attributes • Capture changes as they happen in source systems for improving Data Quality • Data warehouses want to transfer only the relevant changes to the source data since the last transfer (also called incremental loading). Doing a ‘full refresh’ is usually undesirable other than for smaller tables • Capture deletions, edits and inserts
CHANGE DATA CAPTURE (CDC) • Why is CDC important? • Reduces load time, required resources and associated costs • Provides a solution for the continued and accelerating growth in data volumes • Supports lower delivery latencies for real-time data warehousing • Mitigates the risk of failure in long-running ETL jobs. • Four ways of implementing CDC: • Database Log Scraping • Audit Columns • Timed Extracts • Full database “diff compare”
UNIQUE IDENTIFIER/NATURAL KEY • Why is a unique identifier important? • Data Warehouses typically store years of history –the source system may only store the last three months of history and reuse the unique identifier • Data Warehouses typically need to integrate multiple source systems • Knowing a birthdate, gender and ZIP code is enough to identify 87 percent of the people living in the United States • Fuzzy vs Exact matches
DATA QUALITY • Why does it happen? • Bad businesses processes • Inadequate software training • Bad software design and implementation • Consequences on BI • Bad or delayed decisions • Revenue Impacts • Money spent in standardizing or cleansing data downstream • Affects ability to reach customers, cross-sell and up-sell • Efficiency Impacts • Time and resources spent on fixing data quality issues
HIERARCHIES • A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels collectively makes up a dimension or a tree. • Why important? • Usability - Hierarchies are the key to navigating dimensions • Query Performance • Level • A position in a hierarchy • Hierarchies are not always officially part of the source system
THE HOLY GRAIN(L) • Most of the basic control documents that transfer money and goods (or services) from place to place take the parent child form. • An invoice (the parent) has many line items (the children). • Other obvious examples besides invoices include orders, bills of lading, insurance policies, and retail sales tickets. • Parent-child transaction databases commonly have facts of different granularity • Data Warehouses often try to allocate header level facts down to the child line item detail level • This provides the ability to slice and dice and roll up the facts to the child line-item detail
HISTORY • Data Warehouses typically need to do a one-time history load when they go live • Also called initial load or initial/seeding load • The data in the operational systems may often be in multiple formats • It is believed that data quality gets poorer and poorer as the data gets older • Missing data value issues often must be dealt with when loading history
SURVIVORSHIP (CONFLICT RESOLUTION) • Overlapping data from multiple systems • Survivorship (Conflict resolution) – Rules to define how to assemble a single record from two or more records with overlapping attributes that may contain conflicting values • Resolution strategies can be • Instance based – rely on the actual data values • Deciding versus Mediating strategies • Metadata based – rely on freshness of data or reliability of source or which data element was most recently updated
SEMANTIC COMPLEXITY • Different users of a database have different conceptions of what the data represents • Take the example of a table of mobile phone numbers – there are values of all 0’s, nulls, all 9’s and valid digits • What does null mean? • The record is of someone who does not have a mobile phone • The record is of someone who has a mobile phone, and chose not to supply the number • The record is of someone who has a mobile phone, but forgot to supply the number or the number was hard to decipher and recorded as null • What do the all 0’s and all 9’s mean?
QUESTIONS? • Contact me at nmohanty@navmp.com
CHANGE DATA CAPTURE (CDC) • Why the need to identify changes? • Data Warehouses need to identify slowly changing dimensions • Master Data Management (MDM) needs to identify when changes were made to master data attributes • Capture changes as they happen in source systems for improving Data Quality • Data warehouses want to transfer only the relevant changes to the source data since the last transfer (also called incremental loading). Doing a ‘full refresh’ is usually undesirable other than for smaller tables • Capture deletions, edits and inserts
CHANGE DATA CAPTURE (CDC) • Why is CDC important? • Reduces load time, required resources and associated costs • Provides a solution for the continued and accelerating growth in data volumes • Supports lower delivery latencies for real-time data warehousing • Mitigates the risk of failure in long-running ETL jobs. • Four ways of implementing CDC: • Database Log Scraping • Audit Columns • Timed Extracts • Full database “diff compare”
UNIQUE IDENTIFIER/NATURAL KEY • Why is a unique identifier important? • Data Warehouses typically store years of history –the source system may only store the last three months of history and reuse the unique identifier • Data Warehouses typically need to integrate multiple source systems • Knowing a birthdate, gender and ZIP code is enough to identify 87 percent of the people living in the United States • Fuzzy vs Exact matches
SURVIVORSHIP (CONFLICT RESOLUTION) • Overlapping data from multiple systems • Survivorship (Conflict resolution) – Rules to define how to assemble a single record from two or more records with overlapping attributes that may contain conflicting values • Resolution strategies can be • Instance based – rely on the actual data values • Deciding versus Mediating strategies • Metadata based – rely on freshness of data or reliability of source or which data element was most recently updated
HIERARCHIES • A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels collectively makes up a dimension or a tree. • Why important? • Usability - Hierarchies are the key to navigating dimensions • Query Performance • Level • A position in a hierarchy • Hierarchies are not always officially part of the source system
THE HOLY GRAIN(L) • Most of the basic control documents that transfer money and goods (or services) from place to place take the parent child form. • An invoice (the parent) has many line items (the children). • Other obvious examples besides invoices include orders, bills of lading, insurance policies, and retail sales tickets. • Parent-child transaction databases commonly have facts of different granularity • Data Warehouses often try to allocate header level facts down to the child line item detail level • This provides the ability to slice and dice and roll up the facts to the child line-item detail
DATA QUALITY • Why does it happen? • Bad businesses processes • Inadequate software training • Bad software design and implementation • Consequences on BI • Bad or delayed decisions • Revenue Impacts • Money spent in standardizing or cleansing data downstream • Affects ability to reach customers, cross-sell and up-sell • Efficiency Impacts • Time and resources spent on fixing data quality issues
HISTORY • Data Warehouses typically need to do a one-time history load when they go live • Also called initial load or initial/seeding load • The data in the operational systems may often be in multiple formats • It is believed that data quality gets poorer and poorer as the data gets older • Missing data value issues often must be dealt with when loading history
SEMANTIC COMPLEXITY • Different users of a database have different conceptions of what the data represents • Take the example of a table of mobile phone numbers – there are values of all 0’s, nulls, all 9’s and valid digits • What does null mean? • The record is of someone who does not have a mobile phone • The record is of someone who has a mobile phone, and chose not to supply the number • The record is of someone who has a mobile phone, but forgot to supply the number or the number was hard to decipher and recorded as null • What do the all 0’s and all 9’s mean?
QUESTIONS? • Contact me at nmohanty@navmp.com