1 / 47

Efficient Determination of Dynamic Split Points in a Decision Tree

Efficient Determination of Dynamic Split Points in a Decision Tree. Max Chickering Chris Meek Robert Rounthwaite. Outline. Probabilistic decision trees Simple algorithm for choosing split points Empirical evaluation. Probabilistic Decision Trees. Input variables: X 1 , …, X n

hedya
Download Presentation

Efficient Determination of Dynamic Split Points in a Decision Tree

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Determination of Dynamic Split Points in a Decision Tree Max Chickering Chris Meek Robert Rounthwaite

  2. Outline • Probabilistic decision trees • Simple algorithm for choosing split points • Empirical evaluation

  3. Probabilistic Decision Trees Input variables: X1, …, Xn Target variable: Y variable = feature = attribute = ... A probabilistic decision tree is a mapping from X1, …, Xn to p(Y | X1, …, Xn)

  4. Example X1 0 1 X1 binary X2 continuous Y binary X2 < 0 ≥ 0 0.5, 0.5 0.2, 0.8 0.4, 0.6 p(Y=0| X1 = 1, X2 = -15.6) = 0.2

  5. Applications Using Probabilities • Which advertisements to show p(click ad | pages visited) • Recommending TV shows p(watch show i | other viewing preferences) • Predicting demographics from Web usage p(Age > 34 | web sites visited)

  6. Record X Y Z • 0 0 red • 2 0 blue • . . . . • . . . . • . . . . • n 1 7 green Learning Trees from Data Z red green,blue 0.2, 0.8 0.4, 0.6 Automatically Learn a tree for p(X | Y, Z)

  7. X1 X2 Xn Xn X1 X3 Score3(Data) Score1(Data) Scoren(Data) Score2(Data) Score1(Data) Scoren(Data) X2 X2 X2 X2 Greedy Algorithm

  8. X1 X3 Learning Trees From Data DL: Data relevant to leaf L X2 X2 X2 0 1 0 1 0 1 L Most scores: increase from split can be scored locally

  9. Z 1 2 0 Candidate Splits: Discrete Predictors • Complete Split • Hold-one-out • Arbitrary subsets Z Z Z 0 1,2 1 0,2 2 0,1

  10. Continuous Predictors • Static Discretization + Simple: reduces problem to all discrete - Not context-dependent • Dynamic Discretization Binary splits: choose a single split point Which candidate split points do we consider during search? Z < c ≥ c

  11. Continuous Predictors Sorted Data for Predictor Z 49.1 -19.3 X X X X X X X X X c1 c2 c3 Most scoring functions (e.g. Bayesian, entropy) cannot distinguish candidate c1 from candidate c2

  12. Continuous Predictors Try all midpoints (or all values) as candidate splits X X X X X X X X X Approach taken by many DT learning systems

  13. Continuous Predictors Optimization for Discrete Target (Fayyad and Irani,1993) Maximum entropy, Bayesian scores 0 1 1 Target 0 1 0 1 0 0 X X X X X X X X X

  14. Complexity Analysis nL records in the data relevant to leaf L n total data records m continuous predictor variables • Sorting the data Approach 1: O(mnLlog nL) time / leaf Approach 2: O(m n log n) space / tree • Scoring the candidate splits(majority of learning time) O(m nL) scores / leaf Expensive for Large Datasets!

  15. Can we do something more scalable? • Can we choose candidate split points without sorting? • How many candidate split points do we need to learn good models?

  16. A Simple Solution: Quantile Method • Fit a distribution to the predictor points • Find quantiles 1/3 1/3 1/3 X X X X X X X X X c1 c2 For 2 split points, divide into 3 equal-probability regions

  17. A Simple Solution: Quantile Method For every leaf node in the tree: • Use the same type of distribution • Use the same number of split points k

  18. Gaussian Distribution • Need mean, SD for every continuous predictor • Constant-time calculation for each quantile point X X X X X X X X X

  19. Uniform Distribution • Need min, max for every continuous predictor • Constant-time calculation for each quantile point X X X X X X X X X

  20. Empirical Distribution: K-Tiles K-tile approach: k=1 results in the median O(k n) algorithm for small number of split points k Sorting better for large k X X X X X X X X X

  21. Experiments • Varied k for all three distributions • Datasets • Three small datasets from Irvine • Two real-world datasets • Bayesian Score • Models • Learned a decision tree for every variable • Potential predictors were all other variables

  22. Evaluation • Split data into Dtrain and Dtest • Learn a tree for target xi using Dtrain • Evaluate Dtestvia log score:

  23. Method Scores • Gaus(xi, k), Uniform(xi, k), KTile(xi, k) Log score of tree learned for xi using k split points with the given method • All(xi) Log score of tree learned for xi using all split points

  24. Evaluation • Relative improvement of Gaussian method • Relative improvement of Uniform method • Relative improvement of KTile method

  25. Census Results • 37 variables • ~300,000 records • 31 trees containing a continuous split using at least one method • Report average of the 31 relative improvements

  26. Census Results: Average Relative Improvement Number of Candidate Splits Average Relative Improvement

  27. Census Results: Learning Time Number of Candidate Splits Gaussian/KTile about the same: savings is in the scoring

  28. Census Results: k =15 Tree Index Relative Improvement With 15 points, within 1.5% of all-points method

  29. Media Metrix Results • “Nielsen on the web”: demographics / internet-use • 37 variables • ~5,000 records • 16 trees containing a continuous split using at least one method • Report simple average of the 16 relative improvements

  30. Media Metrix Results: Average Relative Improvement Candidate Splits Average Relative Improvement

  31. Media Metrix Results: Learning Time Number of Candidate Splits

  32. Media Metrix Results: k = 15 Relative Improvement Tree Index

  33. Summary • Gaussian and K-tile approaches yield good results with very few points (k = 15) • Uniform failure probably due to outliers • Gaussian approach easy to implement • Fewer split points lead to faster learning

  34. Future Work • Consider k as a function of the number of records at the leaf • More experiments

  35. For More Info… My home page: http://research.microsoft.com/~dmax Relevant Papers WinMine Toolkit

  36. Evaluation • Log score for a tree Log score = log S p(xi | other variables) S*j : log predictive score of “all splits” method sm(k,j) : log predictive score of method m using k split points Relative increase for a method m IncM(k,j) Took simple average of increases for those trees containing a continuous split for at least one method

  37. Results • Gaussian and Empirical Distributions work well • Only about 10 split points are needed to be as accurate as using all points

  38. Example: Gaussian, k = 3 25% 25% 25% 25% Estimate Gaussian for each continuous predictor: Accumulate sums and sums of squares for each predictor Calculate points that divide the Gaussian into 4 equal-probability regions

  39. Distributions • Gaussian • Need mean, SD for every continuous predictor • Constant-time lookup for quantile points • Uniform • Need min, max for every continuous predictor • Constant-time lookup for quantile points • Empirical • O(m n) algorithm : sorting better for large m

  40. Learning Trees: A Bayesian Approach X1 0 1 X1 0 1 X2 < 0 >= 0 T2 T1 Choose most probable tree structure given the data (choose most likely parameters after structure is determined)

  41. Local Scoring Typical Scoring Functions Decompose A A 0 0 B B 0 0 X 0 1 Increase in Score = f(Data w/ A=0,B=0) The score of a split on a leaf depends only the relevant its records

  42. Greedy Algorithm • Start with empty tree (single leaf node) • Repeatedly replace a leaf node by the split that increases the score the most • Stop when no replacement increases the score

  43. Example: p(Y | X,Z) X 0 1 Current State T0 : Consider all splits on Z in left child X X 0 0 1 1 Z Z 1 0 0,2 1,2 If one of the splits improves the score, apply the best one

  44. Census Results: Average Relative Improvement

More Related