340 likes | 492 Views
Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches. Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey. Example Simple Decision Tree Method. Construction: at each node, a feature is chosen randomly Discrete:
E N D
Effective Estimation of Posterior Probabilities:-Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Example Simple Decision Tree Method • Construction: at each node, a feature is chosen randomly • Discrete: • only if it has never been chosen previously on a given decision path starting from the root of the tree. • every example on the same path has the same discrete feature value. • Continuous feature: • can be chosen multiple times on the same decision path. • each time a random threshold value is chosen
continues… • Stop when: • Number of examples in the leaf node is too small. • The total height of the tree exceeds some limits. • Each node of the tree keeps the number of examples belonging to each class. • For example, 10 + and 5 – • Construct at least 10 trees but no need to be more than 30.
Classification • Each tree output estimated posterior probability: • A node with 10 + and 5 - outputs P(+|x,t) = 0.67 • Multiple trees average their probability estimates as the final output. • Use the estimated probability and given loss function to choose label that minimize expected loss. • 0-1 loss or traditional accuracy: choose the most probable label • Cost-sensitive: choose the label that minimize risk.
Difference from Traditional • No Gain function. • Info gain • Gini index • Kearn-Mansour criteria • others • No Feature Selection. • Don’t choose feature with highest “gain” • Multiple trees. • Relies on probability estimates.
How well it works? • Credit card fraud detection: • Each transaction has a transaction amount. • There is an overhead $90 to challenge a fraud. • Predict fraud iif • P(fraud|x) $1000 > $90 • P(fraud|x) $1000 is expected loss • When expected loss is more than overhead, do sth. • Three models: • Traditional Unpruned decision tree • Traditional Pruned decision tree • RDT
Randomization • Feature selection randomization: • RDT: completely random. • Random Forest: consider random subset at each node. • etc • Feature subset randomization. • Fixed random subset. • Data randomization: • Bootstrap sample. Bagging and Random Forest • Data Partitioning • Feature Combination.
Methods Included • RDT: • Choose feature randomly. • Choose threshold for continuous randomly. • RF and RF+ (variation of Random Forest): • Chooses k features randomly. • Choose the one among k with highest infogain • Variation I: use original dataset. • Variation II: output probability instead of voting.
More Methods • Bagged Probabilistic Tree: • Bootstrap • Compute probability. • Traditional Tree • Disjoint Subset Trees: • Shuffle the data. • Equal-sized subsets. • Traditional Tree
Some concepts • True posterior probability P(y|x) • Probability of an example to be a class y as a condition of its feature vector x • Generated from some unknown function F • Given a loss function, the optimal decision y* is the class label that minimizes the expected loss. • 0-1 loss: the most probable label. • Binary problem: class +, class – • P(+|x) = 0.7 and P(-|x) = 0.3 • Predict + • Cost-sensitive loss: choose the class label that reduces expected risk. • P(fraud|x) * $1000 > $90 • Optimal label *y may not always be the true label. • For example, 0-1 loss, P(+|x) = 0.6, the true label may be – with 0.4 probability
Estimated Probability • We use M to “approximate” true function F. • We almost never know F. • Estimated probability by a model M, P(y|x,M). • The dependency on M is none-trivial: • Decision tree uses tree structure and parameters within the structure to approximate P(y|x) • Mixture model uses basis functions such as naïve Bayes and Gaussian. • Relation between P(y|x,M) and P(y|x)?
Important Observation • If P(y|x,M) = P(y|x), the expected loss for any loss function will be the smallest. • Interesting cases: • P(y|x, M) = P(y|x) and 0-1 loss,100% accuracy? • Yes, only if the problem is deterministic or P(y|x) =1 for the true label and 0 for all others! • Otherwise, you can only choose the most likely label, but it can still be wrong for some examples. • Can M beat the accuracy of P(y|x), even if P(y|x, M) =! P(y|x)? • Yes, for some specific example or specific test set. • But not in general or not “expected loss’’
Reality • Class labels are given, however • P(y|x) is not given in any dataset unless the dataset is synthesized. • Next Question: how to set the “true” P(y|x) for a realistic dataset?
Choosing P(y|x) • Naïve Approach • Assume that P(y|x) is 1 for the true class label of x and 0 for all class labels. • For example, two class problem + and – • If x’s label is +, assume P(+|x) = 1 and P(-1|x) = 0 • Only true if the problem is • determinisitic and • noise free. • Rather strong assumption and may cause problems. • X has true class label: + • M1: P(+|x,M1) = 1, P(-|x,M1) =0 • M2: P(+|x,M2) = 0.8, P(-|x,M2) = 0.2 • Both M2 and M1 are correct. • But Penalize M2
Utility-based Choice of P(y|x) • Definition: v is the probability threshold for model M to correctly predict the optimal label y* of x. • If P(y*|x,M) > v, predict y* • Assume *y to be the true class label of an example. • Example, binary class, 0-1 loss • v=0.5 or If P(y|x,M) > 0.5, predict y • Example, credit card fraud cost-sensitive loss • P(y|x,M) * $1000 > $90 • v = 90/1000 = 0.09 • In summary, we use • [v, +1] as the range of true probability P(y|x) • This is weaker than assuming P(y|x) = 1 for the true class label.
Example • Two class problem: • Naïve assumption: • P(y|x) = 1 for the correct label. • 0 for all others. • We assume P(y|x) (0.5, 1] • It includes “naïve assumption” P(y|x) = 1. • We re-define some measurements to fix the problem of “penalty”.
Desiderata • If P(y|x,M) [v, 1], the exact value is trivial, since we already predict the true label. • When P(y|x,M) < v(x,M), the difference is important. • Measures how far off we are from making the right decision. • Take into account the loss function, since the goal is to minimize its expected value.
Evaluating P(y|x,M) • Improved MSE • Square Error: • Where [[a]] = min(a, 1) • Cross-entropy: • Undefined either when P(y|x.M) = 0 • or true probability P(y|x) = 0 • No relation to loss function. • Reliability plots previously proposed and used such as Zadrozny and Elkan’02 (explain later)
Synthetic Dataset • True probability P(y|x) is known and can be used to measure the exact MSE. • Standard Bias and Variance Decomposition of MSE
Binary Dataset • Donation Dataset: • Send a letter to solicit donation. • Costs 68c to send a letter • Cost-sensitive loss: • P(donate|x) * amt(x) > 68c • Used MLR to estimate amt(x). Better results could be obtained by Heckman’s two-step procedure (Zadrozny and Elkan’02)
Reliability Plot • Divide score or output probability into bins • Either equal size such as 10 or 100 bins. • Or equal number of examples. • For those examples in the same bin: • Average the predicted probability of these examples, and call it bin_x • Divide the number of examples with label y by the total number of examples in the bin, call it bin_y • Plot (bin_x, bin_y)
Multi-Class Dataset • Artificial Character Dataset from UCI • Class labels: 10 letters • Three loss functions: • Top 1: the true label is the most probable letter. • Top 2: the true label is among the two most probable letters. • Top 3: the true label is among the top three.
What we learned • On studies of probability approximation: • Assuming P(y|x)=1 is a very strong assumption and cause problems. • Suggested a relaxed choice of P(y|x). • Improved definition of MSE that takes into loss. • Methodology part: • Proposed a variation of Random Forest.
Summary of Experiments • Various experiments • Synthetic with true probability P(y|x) • Binary and multi-class problems • Reliability plots and MSE show that randomized approaches approximate P(y|x) significantly closer. • Bias and Variance Decomp of Probability as compared to loss function. • Reduction comes mainly from variance • Bias is reduced as well
What next • We traditionally think that probability estimation is a harder problem than class labels: • Simplified approach: naïve Bayes. Uncorrelated assumption. • Finite mixture models: still based on assumption of basis function. • Logistic regression: sensitive to example layout, and subjective use to categorical features. • Bayes network: need knowledge about causal relations. NP-hard to find the optimal one.
continued • We show that rather simple randomized approaches approximate probability very well. • Next step: is it time for us to re-design some better and simpler algorithms to approximate probability better?