CS 345: Topics in Data Warehousing

CS 345:Topics in Data Warehousing Tuesday, November 16, 2004

Review of Thursday’s Class • Dimension Key Mapping Revisited • Comments on Assignment #2 • Updating the Data Warehouse • Incremental maintenance vs. drop & rebuild • Self-maintainable views • Approximate Query Processing • Sampling-based techniques • Computing confidence intervals • Online vs. pre-computed samples • Sampling and joins • Alternative techniques

Outline of Today’s Class • Data Mining • What is data mining? • Types of data mining • Data mining pitfalls • Decision Tree Classifiers • What is a decision tree? • Learning decision trees • Entropy • Information Gain • Cross-Validation

Data Mining • What is data mining? • Many definitions… • Basically: identify interesting patterns in data • Most often, the term “data mining” refers to automatic detection of patterns through machine learning • Data mining is one part of the broader process of knowledge discovery in databases (KDD) • KDD: “the process of identifying valid, novel, potential useful, and ultimately understandable patterns in data” • This is what data warehousing is all about. • Data mining is a field of research • Draws from databases, artificial intelligence, statistics • Relatively new research community • Several conferences and journals • ACM KDD, SIAM Data Mining, IEEE ICDM

Knowledge Discovery in Databases Knowledge Interpretation/Evaluation • Validation Tests • Visualization Data Mining • Identify Patterns • Generate Models Preprocessing • Selection • Cleaning • Transformation • Feature Extraction Data

Types of Data Mining • OLAP • Group-by aggregation queries are a simple type of data mining • Summarize the data set • Classification • Build predictive model to categorize records into discrete classes • Examples: • Classify mortgage applicants as “will default” or “will not default” • Face recognition in image database • Identify likely terrorists vs. unlikely terrorists • Regression • Build predictive model to predict real-valued function • Examples: • Predict how much revenue each customer will generate • Predict profitability of planned marketing campaign • Clustering • Separate data records into groups of similar items • Clustering vs. Classification • Classification is supervised, clustering is unsupervised • Classification uses pre-defined class labels, clustering doesn’t. • Classification has a “right answer”, clustering doesn’t.

Types of Data Mining • Outlier detection • Identify unusual or atypical data records • Sometimes to investigate them further • Sometimes to exclude them from a broad analysis • Trend analysis / forecasting • Identify changes in patterns of data over time • Example: What will be next month’s revenue? • Dependency detection • Which attributes are correlated with one another? • Which attribute values are likely to occur together? • Popular technique: Association rule mining • Also known as market basket analysis • Find products that are often bought together as part of same transaction • Temporal pattern detection / time series mining • Recognize commonly recurring patterns in time series data • Example: “Technical analysis” of financial markets

Data Mining Pitfalls • Overfitting • Spurious patterns may emerge by chance • Don’t mistake coincidence for causality • Example: ESP experiment • Ask 10,000 test subjects to predict whether each of 10 face-down playing cards is red or black • 10 subjects predicted all 10 cards correctly! • “Conclusion”: 1 out of every 1000 people have ESP • Can be a particular concern in datasets with • Lots of attributes • Not too many records • Reporting “obvious” patterns • Learning cancer risk factors • Women are more likely than men to have breast cancer • Men are more likely than women to have prostate cancer • These patterns are not “novel”

Data Mining Pitfalls • Confusing correlation and causation • Data mining can identify attributes that are correlated • Correlation doesn’t necessarily imply causation • Example: Studying causes of obesity • Overweight people are more likely to drink diet soda • “Conclusion”: Drinking diet soda causes obesity • Moral of the story: Interpretation and evaluation of patterns is crucial • Data mining algorithms are not magical • Patterns they identify must be examined carefully to avoid drawing inappropriate conclusions

Decision trees are one type of classification model Internal nodes of decision tree labeled with attributes Each internal node represents a test Edges labeled with attribute values Edges represent the results of the tests Leafs labeled with class values Leafs represent the classifier’s predictions To classify a record, walk down the tree starting at the root The path that is followed depends on the attribute values of the record being classified Decision Tree Classifiers Employed? Yes No Credit Score? Income? High Low High Low Approve Reject Approve Reject

Decision Tree Learning • We’re given a data set with unknown values for an attribute of interest • Example: • Data set is Customer records • Attribute of interest is “Will Close Account in Next 3 Months” • Unknown attribute referred to as target attribute • This data set is referred to as the test set • We also have a second data set where the values of the target attribute are known • Referred to as the training set • We would like to build a decision tree classifier to predict the value of the target attribute • Construct a decision tree that accurately classifies the records in the training set • Use the decision tree to predict the value of the target attribute for the records from the test set • Hopefully a classifier that works well on the training set will also work well on the other data set!

Decision Tree Learning • When does decision tree learning work well? • Training set and test set are similar • Patterns in the training set are also present in the test set • Rules learned from one data set apply to the other • Decision tree identifies general, globally valid patterns • And not specific, idiosyncratic properties of the training records • Need to avoid overfitting the model to the training set • Occam’s razor: simple explanations are usually the best • Simple (small) decision trees are usually preferable • Easier for humans to interpret • Usually less prone to overfitting • Finding the smallest accurate decision tree is NP-Hard • Decision trees are usually built top-down using greedy heuristic • Idea: First test attributes that do best job of separating the classes

Decision Tree Learning • Basic decision tree learning algorithm • Do all records in training set belong to same class? • Yes → Return leaf node with that class. • Do all records in training set have the same values for all attributes (other than target)? • Yes → Return leaf node with most common class. • Otherwise: • Pick the single attribute that best separates records from different classes • Use that attribute for the root of the decision tree • Children of root node are decision trees • Build them recursively using same algorithm

Splitting Criterion • How to decide which attribute is best to test first? • Each attribute splits data into subsets • Ideally, each subset should be as homogenous as possible • Need metric for homogeneity of a data set • Example: • Two classes, +/- • 100 records overall (50 +s and 50 -s) • A and B are two binary attributes • Records with A=0: 48+, 2-Records with A=1: 2+, 48- • Records with B=0: 26+, 24-Records with B=1: 24+, 26- • Splitting on A is better than splitting on B • A does a good job of separating +s and -s • B does a poor job of separating +s and -s

Entropy • Entropy is a good way to measure homogeneity • Measures minimum number of bits per record needed to optimally encode class values • Entropy example: • Three classes (A,B,C) • A occurs ½ of the time • B and C each occur ¼ of the time • Optimal encoding: A = 0, B = 10, C = 11 • Entropy = Average bits / record = 1.5 • Entropy formula: • Entropy of data set S is denoted H(S) • cis are the possible classes • pi = fraction of records from S that have class ci

Entropy Examples • Example: • 10 records have class A • 20 records have class B • 30 records have class C • 40 records have class D • Entropy = -[(.1 log .1) + (.2 log .2) + (.3 log .3) + (.4 log .4)] • Entropy = 1.85 • Earlier example revisited • Two classes, +/- • 100 records overall (50 +s and 50 -s) • A and B are two binary attributes • Records with A=0: 48+, 2- Entropy = 0.24 Records with A=1: 2+, 48- Entropy = 0.24 • Records with B=0: 26+, 24- Entropy = 0.99Records with B=1: 24+, 26- Entropy = 0.99 • A is better than B because average entropy is less after splitting on A

Information Gain • Information gain = Expected reduction in entropy • Expected entropy after splitting on attribute A: H(S|A) • H(S|A) = Sum [(percentage of records with A=ai)*(Entropy of records with A=ai)] • Sum is taken over all possible values of attribute A • Computes weighted average entropy across all subsets • Weight of subset = number of records in the subset • Always split on attribute with greatest information gain • This is one possible splitting rule for building decision trees • However, other splitting criteria are also used sometimes • Gain ratio, Gini index, etc. • Alternative methods of measuring homogeneity

Decision Tree Example Predicting the weather Target attribute = Weather Source attributes = State, Season, Barometer

Decision Tree Example State:AK: 2 Snow, 1 Sun → 0.92HI: 3 Sun → 0.00 CA: 2 Rain, 1 Sun → 0.92 Entropy = 0.62 Season:Winter: 2 Snow, 2 Sun, 1 Rain → 1.52Summer: 3 Sun, 1 Rain → 0.81 CA: 2 Rain, 1 Sun → 0.92 Entropy = 1.20 Barometer:Down: 1 Snow, 4 Sun → 0.72Up: 1 Snow, 1 Sun, 2 Rain → 1.50 Entropy = 1.07

Decision Tree Example State = AK:Split on Season Winter = Snow Summer = Sun State = HI:Leaf node = Sun State = CA:Split on Barometer Up = Rain Down = Sun

Decision Tree Example State CA AK HI Barometer Season Sun Down Up Summer Winter Sun Snow Sun Rain

Overfitting and Pruning • Performance graph at right exhibits typical phenomenon • Accuracy on training data increases decision tree grows • Accuracy on test data initially increases, then decreases. • Why does this happen? • Highly predictive attributes near root of decision tree capture general patterns • Less predictive attributes added later are mostly capturing statistical noise • Goal: Stop building the decision tree before overfitting kicks in • Pruning→ eliminate lower portions of the decision tree • Replace sub-tree with a leaf node Accuracy Training Set Accuracy Test Set Accuracy Decisiontree size Optimaltree size

Pruning via Cross-Validation • Cross-validation • Separate training set into two parts • Most of the training set is used to build tree • Small holdout set is used to validate accuracy • Post-pruning approach • Build decision tree with training data (less holdout set) • Traverse tree in bottom-up fashion • For each sub-tree: • Consider pruning sub-tree, replacing with leaf node • If pruned tree is more accurate on holdout set, then use it • Otherwise, stick with original sub-tree • Idea behind pruning • Portion of tree that models general patterns works well on holdout set • Portion of tree that fits random noise works poorly on holdout set

Sufficient Statistics • What information is need to determine what attribute to split on? • Need to compute expected entropy of each attribute • To compute expected entropy after splitting on attribute A: • How many records are there with each value of A? • Among the records with each A value, how many belong to each class? • These counts are called sufficient statistics • Computing sufficient statistics via SQL • Use a simple group-by SQL query (one per attribute):SELECT A, Class, COUNT(*)FROM TableGROUP BY A, Class • For non-root nodes, need a WHERE clause for earlier splits:SELECT A, Class, COUNT(*)FROM TableWHERE B=x AND C=yGROUP BY A, Class • Full data cube contains all sufficient statistics for entire decision tree

Decision Trees and Data Warehouses • Generally building a decision tree involves dimension-focused queries • As opposed to typical fact-focused queries • Records for which predictions are made are dimension rows (e.g. Customers, Accounts) • Sometimes queries just involve the dimension table • Other times dimension attributes may be supplemented by virtual behavioral attributes • Two approaches for gathering sufficient statistics • Compute entire data cube (including subtotals) in one query • Issue a series of small group-by queries

CS 345: Topics in Data Warehousing