Efficient Determination of Dynamic Split Points in a Decision Tree

Efficient Determination of Dynamic Split Points in a Decision Tree Max Chickering Chris Meek Robert Rounthwaite

Outline • Probabilistic decision trees • Simple algorithm for choosing split points • Empirical evaluation

Probabilistic Decision Trees Input variables: X1, …, Xn Target variable: Y variable = feature = attribute = ... A probabilistic decision tree is a mapping from X1, …, Xn to p(Y | X1, …, Xn)

Example X1 0 1 X1 binary X2 continuous Y binary X2 < 0 ≥ 0 0.5, 0.5 0.2, 0.8 0.4, 0.6 p(Y=0| X1 = 1, X2 = -15.6) = 0.2

Applications Using Probabilities • Which advertisements to show p(click ad | pages visited) • Recommending TV shows p(watch show i | other viewing preferences) • Predicting demographics from Web usage p(Age > 34 | web sites visited)

Record X Y Z • 0 0 red • 2 0 blue • . . . . • . . . . • . . . . • n 1 7 green Learning Trees from Data Z red green,blue 0.2, 0.8 0.4, 0.6 Automatically Learn a tree for p(X | Y, Z)

X1 X2 Xn Xn X1 X3 Score3(Data) Score1(Data) Scoren(Data) Score2(Data) Score1(Data) Scoren(Data) X2 X2 X2 X2 Greedy Algorithm

X1 X3 Learning Trees From Data DL: Data relevant to leaf L X2 X2 X2 0 1 0 1 0 1 L Most scores: increase from split can be scored locally

Z 1 2 0 Candidate Splits: Discrete Predictors • Complete Split • Hold-one-out • Arbitrary subsets Z Z Z 0 1,2 1 0,2 2 0,1

Continuous Predictors • Static Discretization + Simple: reduces problem to all discrete - Not context-dependent • Dynamic Discretization Binary splits: choose a single split point Which candidate split points do we consider during search? Z < c ≥ c

Continuous Predictors Sorted Data for Predictor Z 49.1 -19.3 X X X X X X X X X c1 c2 c3 Most scoring functions (e.g. Bayesian, entropy) cannot distinguish candidate c1 from candidate c2

Continuous Predictors Try all midpoints (or all values) as candidate splits X X X X X X X X X Approach taken by many DT learning systems

Continuous Predictors Optimization for Discrete Target (Fayyad and Irani,1993) Maximum entropy, Bayesian scores 0 1 1 Target 0 1 0 1 0 0 X X X X X X X X X

Complexity Analysis nL records in the data relevant to leaf L n total data records m continuous predictor variables • Sorting the data Approach 1: O(mnLlog nL) time / leaf Approach 2: O(m n log n) space / tree • Scoring the candidate splits(majority of learning time) O(m nL) scores / leaf Expensive for Large Datasets!

Can we do something more scalable? • Can we choose candidate split points without sorting? • How many candidate split points do we need to learn good models?

A Simple Solution: Quantile Method • Fit a distribution to the predictor points • Find quantiles 1/3 1/3 1/3 X X X X X X X X X c1 c2 For 2 split points, divide into 3 equal-probability regions

A Simple Solution: Quantile Method For every leaf node in the tree: • Use the same type of distribution • Use the same number of split points k

Gaussian Distribution • Need mean, SD for every continuous predictor • Constant-time calculation for each quantile point X X X X X X X X X

Uniform Distribution • Need min, max for every continuous predictor • Constant-time calculation for each quantile point X X X X X X X X X

Empirical Distribution: K-Tiles K-tile approach: k=1 results in the median O(k n) algorithm for small number of split points k Sorting better for large k X X X X X X X X X

Experiments • Varied k for all three distributions • Datasets • Three small datasets from Irvine • Two real-world datasets • Bayesian Score • Models • Learned a decision tree for every variable • Potential predictors were all other variables

Evaluation • Split data into Dtrain and Dtest • Learn a tree for target xi using Dtrain • Evaluate Dtestvia log score:

Method Scores • Gaus(xi, k), Uniform(xi, k), KTile(xi, k) Log score of tree learned for xi using k split points with the given method • All(xi) Log score of tree learned for xi using all split points

Evaluation • Relative improvement of Gaussian method • Relative improvement of Uniform method • Relative improvement of KTile method

Census Results • 37 variables • ~300,000 records • 31 trees containing a continuous split using at least one method • Report average of the 31 relative improvements

Census Results: Average Relative Improvement Number of Candidate Splits Average Relative Improvement

Census Results: Learning Time Number of Candidate Splits Gaussian/KTile about the same: savings is in the scoring

Census Results: k =15 Tree Index Relative Improvement With 15 points, within 1.5% of all-points method

Media Metrix Results • “Nielsen on the web”: demographics / internet-use • 37 variables • ~5,000 records • 16 trees containing a continuous split using at least one method • Report simple average of the 16 relative improvements

Media Metrix Results: Average Relative Improvement Candidate Splits Average Relative Improvement

Media Metrix Results: Learning Time Number of Candidate Splits

Media Metrix Results: k = 15 Relative Improvement Tree Index

Summary • Gaussian and K-tile approaches yield good results with very few points (k = 15) • Uniform failure probably due to outliers • Gaussian approach easy to implement • Fewer split points lead to faster learning

Future Work • Consider k as a function of the number of records at the leaf • More experiments

For More Info… My home page: http://research.microsoft.com/~dmax Relevant Papers WinMine Toolkit

Evaluation • Log score for a tree Log score = log S p(xi | other variables) S*j : log predictive score of “all splits” method sm(k,j) : log predictive score of method m using k split points Relative increase for a method m IncM(k,j) Took simple average of increases for those trees containing a continuous split for at least one method

Results • Gaussian and Empirical Distributions work well • Only about 10 split points are needed to be as accurate as using all points

Example: Gaussian, k = 3 25% 25% 25% 25% Estimate Gaussian for each continuous predictor: Accumulate sums and sums of squares for each predictor Calculate points that divide the Gaussian into 4 equal-probability regions

Distributions • Gaussian • Need mean, SD for every continuous predictor • Constant-time lookup for quantile points • Uniform • Need min, max for every continuous predictor • Constant-time lookup for quantile points • Empirical • O(m n) algorithm : sorting better for large m

Learning Trees: A Bayesian Approach X1 0 1 X1 0 1 X2 < 0 >= 0 T2 T1 Choose most probable tree structure given the data (choose most likely parameters after structure is determined)

Local Scoring Typical Scoring Functions Decompose A A 0 0 B B 0 0 X 0 1 Increase in Score = f(Data w/ A=0,B=0) The score of a split on a leaf depends only the relevant its records

Greedy Algorithm • Start with empty tree (single leaf node) • Repeatedly replace a leaf node by the split that increases the score the most • Stop when no replacement increases the score

Example: p(Y | X,Z) X 0 1 Current State T0 : Consider all splits on Z in left child X X 0 0 1 1 Z Z 1 0 0,2 1,2 If one of the splits improves the score, apply the best one

Census Results: Average Relative Improvement

Efficient Determination of Dynamic Split Points in a Decision Tree

Efficient Determination of Dynamic Split Points in a Decision Tree

Presentation Transcript

Decision tree

Decision tree

Example of a Decision Tree

Decision Tree

Decision tree

“Split Cherry Tree”

Processing a Decision Tree

Decision Tree

DECISION TREE

Efficient Detection of Split Personalities in Malware

PRIA Pre-Pay Decision Tree Fee Determination Tool

Decision Points

DECISION TREE

Decision Tree

Decision Tree

Decision Tree

Decision Tree

Decision Tree

Decision Tree

Split Selection for Decision Tree