CS 345: Topics in Data Warehousing

CS 345:Topics in Data Warehousing Tuesday, October 12, 2004

Review of Thursday’s Class • Facts • Semi-additive facts • “Factless” fact tables • Slowly Changing Dimensions • Overwrite history • Preserve history • Hybrid schemes • More dimension topics • Dimension roles • Junk dimension • More fact topics • Multiple currencies • Master/Detail facts and fact allocation • Accumulating Snapshot fact tables

Outline of Today’s Class • Customer Relationship Management (CRM) • Dimension-focused queries • Drill-across • Conformed dimensions • Customer dimension • Behavioral attributes • Auxiliary tables • Techniques for very large dimensions • Outriggers • Mini-dimensions • Hierarchies • Bridge tables

Customer Relationship Management (CRM) • Currently a hot topic in business data analysis • Idea: Gain better understanding of customer behavior by integrating data from various sources • Multiple interaction types • Orders • Returns • Customer support • Billing • Service / repairs • Multiple interaction channels • Retail store • E-mail • Call center (Inbound / Outbound) • Web site

CRM questions • Customer profitability • Identify most / least profitable customers • 80/20 rule • Customer retention • Which customers are most likely to defect to a competitor? • Which retention measures work best? • Customer acquisition • Which prospects are most promising? • What offers will entice them to become customers? • Up-sell / Cross-sell • Gain additional business from existing customers • Provide targeted offers during “inbound” communications

Dimension-focused Queries • Standard OLAP queries are fact-focused • Query touches one fact table and its associated dimensions • Some types of analysis are dimension-focused • Bring together data from different fact tables that have a dimension in common • Common dimension used to coordinate facts • Sometimes referred to as “drilling across”

Drill-Across Example • Example scenario: • Sales fact with dimensions (Date, Customer, Product, Store) • CustomerSupport fact with dimensions (Date, Customer, Product, ServiceRep) • Question: How does frequency of support calls by California customers affect their purchases of Product X? • Step 1: Query CustomerSupport fact • Group by Customer SSN • Filter on State = California • Compute COUNT • Query result has schema (Customer SSN, SupportCallCount) • Step 2: Query Sales fact • Group by Customer SSN • Filter on State = California, Product Name = Product X • Compute SUM(TotalSalesAmt) • Query result has schema (Customer SSN, TotalSalesAmt) • Step 3: Combine query results • Join Result 1 and Result 2 based on Customer SSN • Group by SupportCallCount • Compute COUNT, AVG(TotalSalesAmt)

A Problem with the Example • What if some customers don’t make any support calls? • No rows for these customers in CustomerSupport fact • No rows for these customers in result of Step 1 • No data for these customers in result of Step 3 • Solution: use outer join in Step 3 • Customers who are in Step 2 but not Step 1 will be included in result of Step 3 • Attributes from Step 1 result table will be NULL for these customers • Convert these NULLs to an appropriate value before presenting results • Using SQL NVL() function

Conformed Dimensions • Bottom-up data warehousing approach builds one data mart at a time • Drill-across between data marts requires common dimension tables • Common dimensions and attributes should be standardized across data marts • Create master copy of each common dimension table • Three types of “conformed” dimensions: • Dimension table identical to master copy • Dimension table has subset of rows from the master copy • Can improve performance when many dimension rows are not relevant to a particular process • Dimension table has subset of attributes from master copy • Allows for roll-up dimensions at different grains

Conformed Dimension Example • Monthly sales forecasts • Predicted sales for each brand in each district in each month • POS Sales fact recorded at finer-grained detail • Product SKU vs. Brand • Date vs. Month • Store vs. District • Use roll-up dimensions • Brand dimension is rolled-up version of master Product dimension • One row per brand • Only include attributes relevant at brand level or higher • Month dimension is rolled-up Date • District dimension is rolled-up Store • Schema • Sales (Date, Product, Store, Promotion, Transaction ID) • Forecast (Month, Brand, District)

Drill-Across Example • Question: How did actual sales diverge from forecasted sales in Sept. ‘04? • Drill-across between Forecast and Sales • Step 1: Query Forecast fact • Group by Brand Name, District Name • Filter on MonthAndYear =‘Sept 04’ • Calculate SUM(ForecastAmt) • Query result has schema (Brand Name, District Name, ForecastAmt) • Step 2: Query Sales fact • Group by Brand Name, District Name • Filter on MonthAndYear =‘Sept 04’ • Calculate SUM(TotalSalesAmt) • Query result has schema (Brand Name, District Name, TotalSalesAmt) • Step 3: Combine query results • Join Result 1 and Result 2 on Brand Name and District Name • Result has schema (Brand Name, District Name, ForecastAmt, TotalSalesAmt) • Outer join unnecessary assuming: • Forecast exists for every brand, district, and month • Every brand has some sales in every district during every month

The Customer Dimension • Customer dimensions can be very wide • Often dozens or even hundreds of attributes • Contact information (name, address, phone, e-mail) • Demographics (age, ethnicity, gender, education, profession, income, household size, etc.) • Psychographics (interests, values, beliefs, attitudes) • Dates (birthday, first purchase, last purchase, online reg. date) • Behavioral scores (RFM, churn propensity, etc.) • Data available from many sources • Information provided directly by customers • Prospect lists acquired from partners or vendors • Syndicated data • Market research • Customs data • Data derived from warehouse analysis

Behavioral Attributes • Customers can be segmented based on past behavior • Aggregated fact data converted to dimensional attributes • Examples: • RFM scoring • Recency of last purchase • Frequency of purchases • Monetary value of purchases • Scores based on predictive models • Propensity to churn • Probability of default • Segmentation based on clustering algorithms • Raw aggregated data • Total dollar sales in past year

Behavioral Attributes • Two techniques for handling behavioral attributes • Dimension attributes generated during ETL process • Stored in dimension table • Good query performance • Limited flexibility • Preserving history possible (but may be expensive) • Virtual attributes created “on demand” via user queries • Stored in auxiliary tables • Very flexible and customizable • Increased management complexity • Increased query complexity • Query performance may suffer • Can’t easily preserve history

Auxiliary Tables for User-Defined Attributes User-CreatedAuxiliary Table Customer Dimension Natural keyof customerdimension User-definedattribute fromquery result Name is natural key • Join dimension and auxiliary table using natural key • Join result looks like “expanded” customer dimension

Another Use of Auxiliary Tables • Track a set of customers over time • For example, a focus group or pre-selected sample • Set of customers may be defined based on a query • Query results may change over time as customer attributes slowly change • How to preserve the initial set? • Create single-column auxiliary table containing natural key of customers in the set • Join to the auxiliary table to filter based on the initial customer set

Continuous vs. Discrete Values • Some attributes have large number of possible values on a continuous scale • Income • Age • Most simple behavioral attributes are of this type • TotalSalesOfProductXIn2003 • Disadvantages of continuous attributes: • Grouping by continuous attribute produces huge, meaningless report • Number of unique attribute combinations explodes • Greater number of rows in dimension • More frequent changes to value of dimension attributes • Group continuous attributes into discrete bands • Like a histogram • Instead of Salary = 47540, use Salary = $40K-$50K • Avoids above disadvantages • Downside = loss of information

Very Large Dimensions • Customer dimensions can be very wide • Dozens or hundreds of attributes • Customer dimensions can be very large • Tens of millions of rows in some warehouses • Sometimes includes prospects as well as actual customers • Size can lead to performance challenges • One case when performance concerns can trump simplicity • Can we reduce width of dimension table? • Can we reduce number of rows caused by preserving history for slowly changing dimension?

Outrigger Tables • Limited normalization of large dimension table to save space • Identify attribute sets with these properties: • Highly correlated • Low in cardinality (compared to # of customers) • Change in unison • Example: • External data provider computes demographic data for each county • 100 demographic attributes are provided • Updates are supplied every six months • Follow these steps for each attribute set: • Create a separate “outriggerdimension” for each attribute set • Remove the attributes from the customer dimension • Replace with a foreign key to the outrigger table • No foreign key from fact row to outrigger • Outrigger attributes indirectly associated with facts via customer dim.

Outrigger Example Customer Dimension County Demographics Outrigger

Outrigger Tables • Advantages: • Space savings • Customer dimension table becomes narrower • Outrigger table has relatively few rows • One copy per county vs. one copy per customer • Disadvantages: • Additional tables introduced • Accessing outrigger attributes requires an extra join • Users must remember which attributes are in outrigger vs. main customer dimension • Creating a view can solve this problem

Mini-Dimensions • Some attributes change relatively frequently • Behavior-based scores • Certain demographic attributes • Age, Income, Marital Status, # of children • How to preserve history without row explosion? • Some attributes are queried relatively frequently • Queries using huge customer dimension are slowed • How to improve query performance? • Create a mini-dimension: • Remove frequently-changing or frequently-queried attributes from the customer dimension • Add them to a separate mini-dimension table instead • Discretize mini-dimension attributes to reduce cardinality • Group continuously-valued attributes into buckets or bands • Example: Age < 20, Age 20-29, Age 30-39, Age 40-49, Age 50+ • Include foreign keys to both customer dimension & mini-dimension in fact table

Mini-Dimensions • Advantages: • History preserved without space blow-up • Fact table captures historical record of attribute values • Mini-dimension has small number of rows • # of unique combinations of mini-dimension attributes is small • Consequence of discretization • Limit number of attributes in a single mini-dimension! • Improved performance for queries that use mini-dimension • At least for those queries that can avoid the main customer dimension • Disadvantages: • Fact table width increases • Due to increased number of dimension foreign keys • Information lost due to discretization • Less detail is available • Impractical to change bucket / band boundaries • Additional tables introduced • Users must remember which attributes are in mini-dimension vs. main customer dimension

Outrigger vs. Mini-Dimension CustomerDimension CustomerDimension Fact Fact Outrigger Mini-dimension

Outrigger vs. Mini-Dimension • Mini-dimension approach • Explicit link between fact table and mini-dimension • No explicit link between customer dimension and mini-dimension • Difficult to express queries that group past customer behavior based on current demographic values • Implicit association via fact table • Outrigger approach • No explicit link between fact table and outrigger • Explicit link between customer dimension and outrigger • Associating facts with outrigger requires join through customer dimension • Preserving history for rapidly-changing attributes leads to customer dimension blow-up • Hybrid approach • Separate some attributes into their own mini-dimension • Add foreign key to mini-dimension to both fact table and customer dimension • Customer dimension foreign key updated using “overwrite history” semantics • Queries based on historically accurate attribute values use fact table foreign key • Queries based on latest attribute values use customer dimension foreign key • Greater expressive power, and greater risk of confusing users!

More Outriggers / Mini-Dims • Lots of information about some customers, little info about others • A common scenario • Example: web site browsing behavior • Web User dimension (= Customer dimension) • Unregistered users • User identity tracked over time via cookies • Limited information available • First active date, Latest active date, Behavioral attributes • Possibly ZIP code through IP lookup • Registered users • Lots of data provided by user during registration • Many more unregistered users than registered users • Most attribute values are unknown for unregistered users • Split registered user attributes into a separate table • Either an outrigger or a mini-dimension • For unregistered users, point to special “Unregistered” row

Handling Hierarchies • Hierarchical relationships among dimension attributes are common • There are various ways to handle hierarchies • Store all levels of hierarchy in denormalized dimension table • The preferred solution in almost all cases! • Create “snowflake” schema with hierarchy captured in separate outrigger table • Only recommended for huge dimension tables • Storage savings have negligible impact in most cases • What about variable-depth hierarchies? • Examples: • Corporate organization chart • Parts composed of subparts • Previous two solutions assumed fixed-depth • Creating recursive foreign key to parent row is a possibility • Employee dimension has “boss” attribute which is FK to Employee • The CEO has NULL value for boss • This approach is not recommended • Cannot be queried effectively using SQL • Alternative approach: bridge table

Fact Bridge Customer parent_id cust_id cust_id child_id Bridge Tables Customer 1 • Customer dimension has one row for each customer entity at any level of the hierarchy • Separate bridge table has schema: • Parent customer key • Subsidiary customer key • Depth of subsidiary • Bottom flag • Top flag • One row in bridge table for every (ancestor, descendant) pair • Customer counts as its own Depth-0 ancestor • 16 rows for the hierarchy at right • Fact table can join: • Directly to customer dimension • Through bridge table to customer dimension Customer 2 Customer 3 Customer 4 Customer 5 Customer 6 Customer 7

Using Bridge Tables in Queries • Two join directions • Navigate up the hierarchy • Fact joins to subsidiary customer key • Dimension joins to parent customer key • Navigate down the hierarchy • Fact joins to parent customer key • Dimension joins to subsidiary customer key • Safe uses of the bridge table: • Filter on customer dimension restricts query to a single customer • Use bridge table to combine data about that customer’s subsidiaries or parents • Filter on bridge table restricts query to a single level • Require Top Flag = Y • Require Depth = 1 • For immediate parent / child organizations • Require (Depth = 1 OR Top Flag = Y) • Generalizes the previous example to properly treat top-level customers • Other uses of the bridge table risk over-counting • Bridge table is many-to-many between fact and dimension

CS 345: Topics in Data Warehousing

CS 345: Topics in Data Warehousing

Presentation Transcript

Data Warehousing/Mining Comp 150 DW Chapter 8. Cluster Analysis

Chapter 7 Using Data Flow Diagrams

Chapter 3: Data Mining and Data Visualization

Introduction to GIS

Topics of the Lecture

Some Topics in Statistical Data Analysis

Data

Midterm Review

Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 6

Types of Warehousing

Data Warehousing and OLAP

Data Mining: Data Preprocessing

Analysis of Complex Survey Data and Survival Analysis

Chapter 2 Java Fundamentals

Data Warehousing : and OLAP — BIS 541 — — Chapter 3 —

Topics for Today

Topics for Today

Topics for Today

Topics in Security Testing

Topics Covered Introduction to Networks Data Communication Terminologies Transmission Media

Create Actionable Data for Schools