1 / 28

Data Warehousing

Data Warehousing. 3. ETL and Data Quality Ponniah , Chapter 12, 13. Objectives. Various aspects of data extraction, transformation and loading Data extraction techniques and challenges Data intergration and consolidation The data load function

berg
Download Presentation

Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehousing 3. ETL and Data QualityPonniah, Chapter 12, 13

  2. Objectives • Various aspects of data extraction, transformation and loading • Data extraction techniques and challenges • Data intergration and consolidation • The data load function • Why ETL is critical, time consuming and arduous • Data quality challenges • The benefits of data quality

  3. The bridge • ETL is the bridge between operational data and your data warehouse • A set of processes that; • Extract data from operational systems • Transform and cleanse data • Index the data • Summarise the data • Load data into the data warehouse

  4. Continued... • ... • Make any changes to source data required of the data warehouse • Restructure keys • Maintain metadata • Refresh the warehouse

  5. Challenges • Source systems are diverse and disparate • Need to deal with systems on different platforms and operating systems • Older legacy systems running obsolete DBMS environments • Historical data – changes in data are not preserved in source systems – yet historical data critical in DW • Quality of data is dubious in legacy systems that have evolved

  6. Challenges cont. • Source systems continue to change over time due to changes in business environments – ETL functions need to be modified accordingly • Gross lack of consistency among systems is prevalent • Data may be redundant / replicated across systems • Inconsistent data creates questions over how to resolve mismatches • Source systems may represent data in types of formats that are cryptic or ambiguous

  7. ETL steps • Determine target data – keep in mind, this may not be all operational data • Determine sources • Prepare data mapping for target elements from sources • Establish complete extraction rules • Determine transformation and cleansing rules • Plan aggregate tables • Organise staging area and test tools • Write procedures for loading

  8. Extraction • Identify source applications and structures • For each data source define whether extraction is tool based or manual • For each data source decide frequency of extraction – daily, weekly etc. • Time window – how long to we have to extract minimising impact on operational environment • Sequencing – what order will the extraction happen, this could be critical • Exception handling – how will be handle errors or failures

  9. Techniques • Static data capture • Flexible • No revision of legacy system required • can use vendor products (SQL) • Can be used on file-oriented systems • Performance of source systems probably not affected

  10. Techniques • Capture through transaction logs • Less flexibility • Performance of source systems probably not affected • Can’t use on filed oriented systems (cos they don’t have txn logs) • Vendor products available

  11. Techniques • Triggers • Can customise to suit • Performance on source systems will be somewhat affected • Can’t be used on legacy and file based systems • No revision to app’s but some internal DB development required • Vendor based

  12. Techniques • Source application capture • Requires major revisions to apps • Performance affected • Can be used on most legacy and file based systems • High costs – cost of revising app code • High risk

  13. Techniques • A couple of other lesser used techniques • Capture based on date / time stamp • File comparison

  14. Transformation • Basic tasks • Selection • Splitting / joining • Conversion • Summarisation • Enrichment

  15. Transformation Types • Combination of the above tasks will be required – now consider some specific types of transformations • Format revisions • Decoding of fields • Calculated and derived values • Splitting single fields • Merging information • Character set conversion • Date time conversions • Summarisation • Key restructuring

  16. Data Loading • Transformation processes end with a “load image” being created. So, this next stage takes this prepared data and applies it to the data warehouse. • 3 types of load (refresh) types • Initial load • Incremental load • Full refresh

  17. Load Modes • Initial load – either creates or inserts data into pre-defined warehouse tables • Append mode – existing data is maintained, new data appended to tables • Destructive merge – If PK of incoming record matches an existing record perform an UPDATE of the row • Constructive merge – If PK matches, leave existing row but mark new row as superceding old row

  18. Data Quality – The Key to Success • Why is it critical ? • What is data quality ? • Types of problems and challenges • Some practical tips

  19. Why is it critical? • Boosts confidence in decision making • Enables better customer service • Increases opportunity to add value • Reduces risk • Reduces costs – esp with marketing campaigns • Enhances strategic decision making • Improves productivity • Avoids compounding effects of data pollution

  20. What is data quality? • Data item is fit for purpose – it accurately reflects what it is supposed to – ie. The thing it represents in the real world • Relates and contributes to the system as a whole • No data without relata – all data is relational • The form and content of the data is consistent across the whole system • Essential in corporate wide data warehouse for business users

  21. Characteristics indicating data quality • Accuracy • Domain integrity • Data type • Consistency • Redundancy (ie. Minimised or controlled) • Complete • Conformance to business rules • Clarity • Usefulness • Adherence to data integrity rules

  22. Types of Problems • Dummy fields – what do they mean? • Absence of values • Unofficial use of fields • Cryptic values (typical in legacy systems) • Contradicting values • Violation of business rules • Reused PKs (sometimes done when old data is archived) • Non Unique identifiers

  23. Types of problems, continued • Inconsistent values • Incorrect values • Multipurpose fields (reuse of fields in legacy systems)

  24. Data Quality Challenges • A good strategy is to deal with it at it’s source • This is often easier said than done • Poor database design – not many of the problems discussed previously relate to this issue. • System conversions – I have personally been involved in moving binary data (that is data with no assigned data type) in a proprietary system to a SQLServer • Heterogeneous system integration – strong possibilities for corrupted and polluted data – be cautious.

  25. Quality, cont. • Incomplete data entry – input fields not complete, often associated with legacy system where NOT NULL was never enforced. • Plain old input errors – legacy systems often did not have code tables – verification was often required by another person, although sometimes not. • Internationalisation/localisation – changes in company structures, mergers, acquisitions etc can all affect quality. • Fraud – not uncommon • Lack of policy – not uncommon

  26. The quality initiative • Consider • You may be limited by budget and resources • What data to cleanse • Where to cleanse • How to cleanse – tools, manual scripts... • You cannot eyeball cleanse millions of rows of data, don’t even think about it. • Try to discover the extent of data pollution – this will help your strategy. Test scripts, regular expressions (I used to use Perl, there are better tools now)

  27. Practical tips • Identify high impact pollution sources and begin purification there • Do not try to do everything manually • Tools are good and useful – Find good ones • Agree one standards and reconfirm these • Link data quality to business objectives • Get senior mgmt involvement • Get users involved – keep them informed • Bring in outside experts for specific assignments

  28. Next week ... • The jewel in the crown of Data Warehouse • Knowledge discovery / Data Mining

More Related