1 / 62

Recent Developments in Data Warehousing

Recent Developments in Data Warehousing. Hugh J. Watson Terry College of Business University of Georgia hwatson@terry.uga.edu http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt. Tutorial Objectives. Provide an overview of data warehousing

torie
Download Presentation

Recent Developments in Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recent Developments in Data Warehousing Hugh J. Watson Terry College of Business University of Georgia hwatson@terry.uga.edu http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt

  2. Tutorial Objectives • Provide an overview of data warehousing • Provide materials to support the teaching of data warehousing • Discuss recent developments in data warehousing

  3. The Importance of Data Warehousing • Provide a “single version of the truth” • Improve decision making • Support key corporate initiatives such as performance management, B2C and B2B e-commerce, and customer relationship management • Estimated to be a $113.5 billion market in 2002 for systems, software, services, and in-house expenditures (Palo Alto Management Group)

  4. Data Warehouse Characteristics • Subject oriented -- data are organized around sales, products, etc. • Integrated -- data are integrated to provide a comprehensive view • Time variant -- historical data are maintained • Nonvolatile -- data are not updated by users

  5. Topics Covered • Definitions and concepts • Two case studies: Harrah’s Entertainment (first) and Owens&Minor (last) • The data mart and enterprise-wide data warehouse strategies • Data extraction, cleansing, transformation and loading • Meta data • Data stores • Online analytical processing (OLAP) • Warehouse users, tools, and applications

  6. Harrah’s Entertainment • Harrah’s Entertainment -- data warehousing supported a successful shift to a CRM oriented corporate strategy. Winner of the 2000 TDWI Leadership Award • Operates 21 casinos across the country • In 1993, the gaming laws changed, which allowed Harrah’s to expand • Harrah’s decided to compete using a brand strategy supported by information technology • Needed to know their customers exceptionally well

  7. Harrah’s Data Warehousing Architecture • WINet sources data from the casino, hotel, and event systems • The patron data base serves as an operational data store • The marketing workbench serves as the data warehouse

  8. Sample Applications • Operational personnel use PDB to check the preferences, history, and value of customers • Analysts use PDB and MWB to create offers to visit a Harrah’s casino • Analysts use MWB to support predictive modeling efforts

  9. Two Data Warehousing Strategies • Enterprise-wide warehouse, top down, the Inmon methodology • Data mart, bottom up, the Kimball methodology • When properly executed, both result in an enterprise-wide data warehouse

  10. The Data Mart Strategy • The most common approach • Begins with a single mart and architected marts are added over time for more subject areas • Relatively inexpensive and easy to implement • Can be used as a proof of concept for data warehousing • Can perpetuate the “silos of information” problem • Can postpone difficult decisions and activities • Requires an overall integration plan

  11. The Enterprise-wide Strategy • A comprehensive warehouse is built initially • An initial dependent data mart is built using a subset of the data in the warehouse • Additional data marts are built using subsets of the data in the warehouse • Like all complex projects, it is expensive, time consuming, and prone to failure • When successful, it results in an integrated, scalable warehouse

  12. Data Sources and Types • Primarily from legacy, operational systems • Almost exclusively numerical data at the present time • External data may be included, often purchased from third-party sources • Technology exists for storing unstructured data and expect this to become more important over time

  13. Extraction, Transformation, and Loading (ETL) Processes • The “plumbing” work of data warehousing • Data are moved from source to target data bases • A very costly, time consuming part of data warehousing

  14. Recent Development:More Frequent Updates • Updates can be done in bulk and trickle modes • Business requirements, such as trading partner access to a Web site, requires current data • For international firms, there is no good time to load the warehouse

  15. Recent Development: Clickstream Data • Results from clicks at web sites • A dialog manager handles user interactions. An ODS helps to custom tailor the dialog • The clickstream data is filtered and parsed and sent to a data warehouse where it is analyzed • Software is available to analyze the clickstream data

  16. Recent Development:Further Automation of ETL Processes • MetaRecon from Metagenix reverse engineers data into information • Analyzes and profiles source systems • Uncovers problems in source systems • Recommends primary and secondary keys, dimensions and measures, etc. • Generates ETL scripts

  17. Data Extraction • Often performed by COBOL routines (not recommended because of high program maintenance and no automatically generated meta data) • Sometimes source data is copied to the target database using the replication capabilities of standard RDMS (not recommended because of “dirty data” in the source systems) • Increasing performed by specialized ETL software

  18. Sample ETL Tools • DataStage from Ascential Software • SAS System from SAS Institute • Power Mart/Power Center from Informatica • Sagent Solution from Sagent Software • Hummingbird Genio Suite from Hummingbird Communications

  19. Reasons for “Dirty” Data • Dummy Values • Absence of Data • Multipurpose Fields • Cryptic Data • Contradicting Data • Inappropriate Use of Address Lines • Violation of Business Rules • Reused Primary Keys, • Non-Unique Identifiers • Data Integration Problems

  20. Data Cleansing • Source systems contain “dirty data” that must be cleansed • ETL software contains rudimentary data cleansing capabilities • Specialized data cleansing software is often used. Important for performing name and address correction and householding functions • Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)

  21. Steps in Data Cleansing • Parsing • Correcting • Standardizing • Matching • Consolidating

  22. Parsing • Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. • Examples include parsing the first, middle, and last name; street number and street name; and city and state.

  23. Correcting • Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. • Example include replacing a vanity address and adding a zip code.

  24. Standardizing • Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. • Examples include adding a pre name, replacing a nickname, and using a preferred street name.

  25. Matching • Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. • Examples include identifying similar names and addresses.

  26. Consolidating • Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

  27. Data Staging • Often used as an interim step between data extraction and later steps • Accumulates data from asynchronous sources using native interfaces, flat files, FTP sessions, or other processes • At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse • There is usually no end user access to the staging file • An operational data store may be used for data staging

  28. Data Transformation • Transforms the data in accordance with the business rules and standards that have been established • Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates

  29. Data Loading • Data are physically moved to the data warehouse • The loading takes place within a “load window” • The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications

  30. Meta Data • Data about data • Needed by both information technology personnel and users • IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. • Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.

  31. Recent Development:Meta Data Integration • A growing realization that meta data is critical to data warehousing success • Progress is being made on getting vendors to agree on standards and to incorporate the sharing of meta data among their tools • Vendors like Microsoft, Computer Associates, and Oracle have entered the meta data marketplace with significant product offerings

  32. Database Vendors • High end (i.e., terabyte plus) vendors include IBM (DB2) and NCR-Teradata (Teradata) • Oracle (8i) and Microsoft (SQL Server 7) are major players for smaller databases

  33. On-line Analytical Processing (OLAP) • A set of functionality that facilitates multidimensional analysis • Allows users to analyze data in ways that are natural to them • Comes in many varieties -- ROLAP, MOLAP, DOLAP, etc.

  34. ROLAP • Relational OLAP • Uses a RDBMS to implement and OLAP environment • Typically involves a star schema to provide the multidimensional capabilities • OLAP tool manipulates RDBMS star schema data • Called slowlap by MOLAP vendors

  35. MOLAP • Multidimensional OLAP • Uses a MDDBS (e.g., Essbase) to store and access data • Usually requires proprietary (non SQL) data access tools • Provides exceptionally fast response times

  36. Star Schema • Creates non-normalized data structures • Easier for users to understand • Optimized for OLAP • Uses fact (facts or measures in the business) and dimension (establishes the context of the facts) tables

  37. OLAP Tools • Products come from vendors such as Brio, Cognos, Hyperion, and BusinessObjects • Typically available as a fat or thin (i.e., browser) client • In a web environment, the browser communicates with a web server, which talks to an application server, which connects to backend databases • The application server provides query, reporting, and OLAP analysis functionality over the web • Java applets or downloaded components augment the thin client • A broadcast server may be used to schedule, run, publish, and broadcast reports, alerts, and responses over the LAN, email, or personal digital assistant.

  38. Dimension Table Examples • Retail -- store name, zip code, product name, product category, day of week • Telecommunications -- call origin, call destination • Banking -- customer name, account number, branch, account officer • Insurance -- policy type, insured party

  39. Fact Table Examples • Retail -- number of units sold, sales amount • Telecommunications -- length of call in minutes, average number of calls • Banking -- average monthly balance • Insurance -- claims amount

  40. The Fact Table Key Concatenates the Dimension Keys Assume that you want to know the number of television sets sold to Best Buys on January 15, 2001. The query might be: SELECT CLIENT.CUSNAME, SALES.NOSOLD FROM CLIENT, PRODUCT, TIME, SALES WHERE CLIENT.CUSNAME=SALES.CUSNAME AND PRODUCT.PRODNAME=SALES.PRODNAME AND TIME.DATE=SALES.DATE AND CLIENT.CUSNAME=“BEST BUYS” AND PRODUCT.PRODNAME=“TELEVISION” AND TIME.DATE=#01/15/2001#

  41. Warehouse Users • Analysts • Managers • Executives • Operational personnel • Customers and suppliers

More Related