Wei Fan, IBM T.J.Watson Joe McCloskey, US Department of Defense Philip Yu, IBM T.J.Watson

A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department of Defense Philip Yu, IBM T.J.Watson

Three DM Problems • Classification: • Label: given set of labels in training data. • Probability Estimation: • Similar to the above setting: estimate the probability that x is an example of class y. • Difference: no truth is given, i.e., no true probability • Regression: • Target value: continuous values.

Model Approximation • True model or correct model. • Generates y for each x with probability P(y|x). • Normally never known in reality. • Perfect model: never makes mistakes or has the same prediction as the true model. • Not always possible due to: • Stochastic nature of the problem • Noise in training data • Data is insufficient

Optimal Model • Loss function L(t,y) to evaluate performance. • Optimal decision decision y* is the label that minimizes expected loss when x is sampled repeatedly: • Examples • 0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud • cost-sensitive loss: the label that minimizes the “empirical risk”. • If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud • MSE or mean square error: predict average

How we look for optimal models? • Don’t impose “exact forms”: • Decision Trees, Classification based on Association rules, Production rules • Learner estimate structure as well as parameters • NP-hard for most “model representation” • Impose “exact forms”: • logistic regression functions, linear regression model, etc • Learners estimate parameter ONLY. Structure is pre-fixed • Inductive Bias. • Decision tree is rather flexible, efficient yet powerful representation.

Consider Decision Tree • Compromise between accuracy and model complexity • We think that simplest-structured hypothesis that fits the data is the best. • We employ all kinds of heuristics to look for it. • info gain, gini index, Kearns-Mansour, etc • pruning: MDL pruning, reduced error-pruning, cost-based pruning. • Reality: tractable, but still pretty expensive • Truth: none of purity check functions guarantee accuracy over testing data.

Random Decision Tree -classification, regression, probability estimation • Key characteristics: • Structure is randomly picked. • Statistics are summarized from training data. • At each node, an un-used feature is chosen randomly • A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. • A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Continued • We stop when one of the following happens: • A node becomes too small. • Or the total height of the tree exceeds some limits: • Such as the total number of features.

Node Statistics • Classification and Probability Estimation: • Each node of the tree keeps the number of examples belonging to each class. • Regression: • Each node of the tree keeps the mean value of examples sorted into the node

B1 < 0.5 Y N B2 > 0.7 B1 > 0.3 N Y Y P1: 200 P2: 10 P1: 30 P2: 70 … … Classification/Prob Estimatimation • During classification, each tree outputs posterior probability: P(P1|x)=0.3

Age >30 Y N Capt> 70% Edu=PhD N Y Y Avg AGI=100K Avg AGI=150K … … Regression • During classification, each tree average value of training examples that falls within each node

Classification • The prediction from multiple random trees are averaged as the final output. • Classification: loss function is needed.

A few words about some of its advantage • Training can be very efficient. Particularly true for very large datasets. • Natural multi-class probability. • Natural multi-label classification and probability estimation. • Imposes very little about the structures of the model.

Number of trees • Sampling theory: • The random decision tree can be thought as sampling from a large (infinite when continuous features exist) population of trees. • Unless the data is highly skewed, 30 to 50 gives pretty good estimate with reasonably small variance. In most cases, 10 are usually enough. • Worst scenario • Only one feature is relevant. All the rest are noise. • Probability: • Variance Deduction:

Donation Dataset-classification and prob estimation • Decide whom to send charity solicitation letter. • It costs $0.68 to send a letter. • Loss function

Result

Credit Card Fraud-classification and prob estimation • Detect if a transaction is a fraud • There is an overhead to detect a fraud, {$60, $70, $80, $90} • Loss Function

Result

Comparing with Boosting • Don’t handle multi-class problems naturally, ECOC • Do not output probabilities. • Inefficient. • Boosting rounds is tricky. Sometimes, more rounds can lead to overfitting. • Inefficient. • Implementation needs careful numerical manipulation.

Comparing with Bagging • Could be very inefficient particularly for very large dataset • i.e., bootstrap sampling needs linear scan of the data. • Do not output reliable probabilities.

Probability Estimation

Overfitting

Non-overfitting of RDT

Selectivity

Tolerance to data insufficiency

Age >30 Y N Capt> 70% Edu=PhD N Y Y MLR MLR … … GUIDE MLR y = a+a1*x1+a2*x2 + … ak*xk

Regression: single independent variable

RDT

Depend on combination of 5 independent variables

RDT

It grows like …

Comparing with GUIDE • Need to decide grouping variables and independent variables. A non-trivial task. • If all variables are categorical, GUIDE becomes a single CART regression tree. • Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results, like the one given earlier

Conclusion • Imposing a particular form of model is not a good idea to train highly-accurate models. • It may not even be efficient for some forms of models. • RDT has been show to solve all three major problems in data mining, classification, probability estimation and regressions, simply, efficiently and accurately.

Selected Bibliography of RDT • ICDM’03: “Is random model better? On its accuracy and efficiency” (Fan, Wang, Yu and Ma) • AAAI’04: “On the Optimality of Posterior Probability Estimation by Random Decision Tree” (Fan) • ICDM’05: “Effective Estimation of Posterior Probabilities: Explaining the Accuracy of Randomized Decision Tree Approaches” (Fan, Greengrass, McCloskey, Yu, and Drummey) • ICDM’05: “Learning through Changes: An Empirical Study of Dynamic Behaviors of Probability Estimation Trees” (Zhang, Buckles, Peng, and Xu) • Master Thesis by Tony Liu, supervised by Kai Ming Ting, “The Utility of Randomness in Decision Tree Construction”, Monash University, 2005 • KDD’06: “A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees”

Wei Fan, IBM T.J.Watson Joe McCloskey, US Department of Defense Philip Yu, IBM T.J.Watson