Data Warehousing

Data Warehousing 3. ETL and Data QualityPonniah, Chapter 12, 13

Objectives • Various aspects of data extraction, transformation and loading • Data extraction techniques and challenges • Data intergration and consolidation • The data load function • Why ETL is critical, time consuming and arduous • Data quality challenges • The benefits of data quality

The bridge • ETL is the bridge between operational data and your data warehouse • A set of processes that; • Extract data from operational systems • Transform and cleanse data • Index the data • Summarise the data • Load data into the data warehouse

Continued... • ... • Make any changes to source data required of the data warehouse • Restructure keys • Maintain metadata • Refresh the warehouse

Challenges • Source systems are diverse and disparate • Need to deal with systems on different platforms and operating systems • Older legacy systems running obsolete DBMS environments • Historical data – changes in data are not preserved in source systems – yet historical data critical in DW • Quality of data is dubious in legacy systems that have evolved

Challenges cont. • Source systems continue to change over time due to changes in business environments – ETL functions need to be modified accordingly • Gross lack of consistency among systems is prevalent • Data may be redundant / replicated across systems • Inconsistent data creates questions over how to resolve mismatches • Source systems may represent data in types of formats that are cryptic or ambiguous

ETL steps • Determine target data – keep in mind, this may not be all operational data • Determine sources • Prepare data mapping for target elements from sources • Establish complete extraction rules • Determine transformation and cleansing rules • Plan aggregate tables • Organise staging area and test tools • Write procedures for loading

Extraction • Identify source applications and structures • For each data source define whether extraction is tool based or manual • For each data source decide frequency of extraction – daily, weekly etc. • Time window – how long to we have to extract minimising impact on operational environment • Sequencing – what order will the extraction happen, this could be critical • Exception handling – how will be handle errors or failures

Techniques • Static data capture • Flexible • No revision of legacy system required • can use vendor products (SQL) • Can be used on file-oriented systems • Performance of source systems probably not affected

Techniques • Capture through transaction logs • Less flexibility • Performance of source systems probably not affected • Can’t use on filed oriented systems (cos they don’t have txn logs) • Vendor products available

Techniques • Triggers • Can customise to suit • Performance on source systems will be somewhat affected • Can’t be used on legacy and file based systems • No revision to app’s but some internal DB development required • Vendor based

Techniques • Source application capture • Requires major revisions to apps • Performance affected • Can be used on most legacy and file based systems • High costs – cost of revising app code • High risk

Techniques • A couple of other lesser used techniques • Capture based on date / time stamp • File comparison

Transformation • Basic tasks • Selection • Splitting / joining • Conversion • Summarisation • Enrichment

Transformation Types • Combination of the above tasks will be required – now consider some specific types of transformations • Format revisions • Decoding of fields • Calculated and derived values • Splitting single fields • Merging information • Character set conversion • Date time conversions • Summarisation • Key restructuring

Data Loading • Transformation processes end with a “load image” being created. So, this next stage takes this prepared data and applies it to the data warehouse. • 3 types of load (refresh) types • Initial load • Incremental load • Full refresh

Load Modes • Initial load – either creates or inserts data into pre-defined warehouse tables • Append mode – existing data is maintained, new data appended to tables • Destructive merge – If PK of incoming record matches an existing record perform an UPDATE of the row • Constructive merge – If PK matches, leave existing row but mark new row as superceding old row

Data Quality – The Key to Success • Why is it critical ? • What is data quality ? • Types of problems and challenges • Some practical tips

Why is it critical? • Boosts confidence in decision making • Enables better customer service • Increases opportunity to add value • Reduces risk • Reduces costs – esp with marketing campaigns • Enhances strategic decision making • Improves productivity • Avoids compounding effects of data pollution

What is data quality? • Data item is fit for purpose – it accurately reflects what it is supposed to – ie. The thing it represents in the real world • Relates and contributes to the system as a whole • No data without relata – all data is relational • The form and content of the data is consistent across the whole system • Essential in corporate wide data warehouse for business users

Characteristics indicating data quality • Accuracy • Domain integrity • Data type • Consistency • Redundancy (ie. Minimised or controlled) • Complete • Conformance to business rules • Clarity • Usefulness • Adherence to data integrity rules

Types of Problems • Dummy fields – what do they mean? • Absence of values • Unofficial use of fields • Cryptic values (typical in legacy systems) • Contradicting values • Violation of business rules • Reused PKs (sometimes done when old data is archived) • Non Unique identifiers

Types of problems, continued • Inconsistent values • Incorrect values • Multipurpose fields (reuse of fields in legacy systems)

Data Quality Challenges • A good strategy is to deal with it at it’s source • This is often easier said than done • Poor database design – not many of the problems discussed previously relate to this issue. • System conversions – I have personally been involved in moving binary data (that is data with no assigned data type) in a proprietary system to a SQLServer • Heterogeneous system integration – strong possibilities for corrupted and polluted data – be cautious.

Quality, cont. • Incomplete data entry – input fields not complete, often associated with legacy system where NOT NULL was never enforced. • Plain old input errors – legacy systems often did not have code tables – verification was often required by another person, although sometimes not. • Internationalisation/localisation – changes in company structures, mergers, acquisitions etc can all affect quality. • Fraud – not uncommon • Lack of policy – not uncommon

The quality initiative • Consider • You may be limited by budget and resources • What data to cleanse • Where to cleanse • How to cleanse – tools, manual scripts... • You cannot eyeball cleanse millions of rows of data, don’t even think about it. • Try to discover the extent of data pollution – this will help your strategy. Test scripts, regular expressions (I used to use Perl, there are better tools now)

Practical tips • Identify high impact pollution sources and begin purification there • Do not try to do everything manually • Tools are good and useful – Find good ones • Agree one standards and reconfirm these • Link data quality to business objectives • Get senior mgmt involvement • Get users involved – keep them informed • Bring in outside experts for specific assignments

Next week ... • The jewel in the crown of Data Warehouse • Knowledge discovery / Data Mining

Data Warehousing