280 likes | 359 Views
Discovering Unrevealed Properties of Probability Estimation Trees: on Algorithm Selection and Performance Explanation Kun Zhang , Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia Xu Dec. 21, 2006. What this Paper Offers. Preference of a probability estimation tree (PET)
E N D
Discovering Unrevealed Properties of Probability Estimation Trees: on Algorithm Selection and Performance ExplanationKun Zhang, Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia XuDec. 21, 2006
What this Paper Offers • Preference of a probability estimation tree (PET) • Many important and previously unrevealed properties of PETs • A practical guide for choosing the most appropriate PET algorithm
Since Unknown Distribution P(X,Y), y=F(x) LS= A Loss/Cost function L Learning Algorithm Model f(x,θ) 0-1 Loss: Cost-Sensitive Loss: f : Challenge: P(x,y) is unknown so is P(y|x)! Why Probability Estimation?- Theoretical Necessity • Statistical Supervised Leaning • Bayesian Decision Rule
Direct estimation of Probability • Decision threshold determination • Non-static, skewed distribution • Unequal loss (Yang,icdm05) Why Probability Estimation?- Practical Necessity Ozone level Prediction Medical Domain Direct Marketing
Posterior Probability Estimation • The true and unknown distribution follows a “particular form”. • Via maximum likelihood estimation • E.g. Naïve Bayes, logistic regression Parametric Methods Posterior Probability Estimation Non-Parametric Approaches a rather unbiased, flexible and convenient solution • Directly calculated without making any assumption • E.g. Decision trees, Nearest neighbors
PETs - Probabilistic View of Decision Trees • , E.g. (C4.5, CART) • Confidences in the predicted labels • Appropriately thesholding for classification w.r.t. different loss functions. • The dependence of P(y|x,θ) on θ is non-trivial
Problems of Traditional PETs • Probability estimates through frequency tend to be too close to the extremes of 1 and 0 • ------------------------------------------------- • Additional inaccuracies result from the small number of examples within a leaf. • ------------------------------------------------- • The same probability is assigned to the entire region of space defined by a given leaf. • C4.4 • (Provost,03) • CFT(Ling,03) • BPET(Breiman,96), • RDT(Fan,03)
Which one to choose? What performances to be expected ? Why should one PET be preferred over another? Popular PET Algorithms
Contributions • A large scale learning curve study using multiple evaluation metrics • Preference of a PET: signal-noise separability of datasets • Many important and previously unrevealed properties of PETs: • In ensembles, RDT is preferable on low-signal separability datasets, while BPET is favorable when the signal separability is high. • A practical guide for choosing the most appropriate PET algorithm
A synthetic scenario – tumor diagnosis • Tumor: signal present • No tumor: signal absent • Based on yes/no decision • P(yes|tumor): hit (TP) • P(yes|no tumor): false alarm (FP) • P(no|tumor): miss (FN) • P(no|no tumor): correct reject (TN) Analytical Tool # 1: AUC - Index of Signal Noise Separability • Signal-noise separability • Correct identification of information of interest and some other noise factors which may interfere this identification. • A good analogy for two different populations present in every learning domain with uncertainty
0.25 f(x|signal) f(x|noise1) f(x|noise2) 0.2 1 0.8 0.15 TPR 0.6 Correct reject Hit 0.4 0.1 Decision Criterion 0.2 False alarm Miss LowSepDist 0 0.05 0 0.5 1 HighSepDist FPR 0 8 6 4 2 0 -2 -4 -6 -8 NoiseSignal Analytical Tool # 1: AUC - Index of Signal Noise Separability • An Illustration AUC: an index for the separability of signal from noise Relative areas of the four different outcomes vary, the separation of the two distribution does not ! • Domains: high/low degree of signal separability • High: deterministic/ little noise • Low: Stochastic/Noisy
Analytical Tool # 2: Learning Curves • Instead of CV or training-test splitting based on fixed data set size • Generalization performance of different models as a function of the size of the training set • Correlation between performance metrics and training set sizes can be observed and possibly generalized over different data sets.
Analytical Tool # 3: Multiple Evaluation Metrics • Area Under ROC Curve (AUC) - Summarizes the “ranking capability” of a learning algorithm in ROC space • MSE (Brier Score) - - A proper assessment for the “accuracy” of probability estimation - Calibration-Refinement decomposition * * Calibration measures the absolute precision of probability estimation * Refinement indicates how confident the estimator is in its estimates * Visualization tools – reliability plotsandsharpness graphs • Error Rate • Inappropriate criterion for evaluating probability estimates -
Conjectures in Summary • RDT and CFT are better on AUC • RDT is preferable on low-signal separability datasets, While BPET is favorable on high-signal separability data sets • High separability categorical datasets with limited feature values hurt RDT • Among single trees, CFT is preferable on low-signal separability datasets
Behind the Scenes- Why RDT and CFT better on AUC? • Superior capability on unique probability generation • AUC calculations: • Trapezoidal integration (Fawcett,03) • (Hand,01) • For larger AUC, P(y|x,θ) should vary from one test point to another • The number of unique probabilities is maximized as a result RDT > BPET > CFT > C4.4 > C4.5
Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? • The reasons: • RDT: discards any criterion for optimal feature selection • More like a structure for data summarization. • When the signal-separability is low, this property protects RDT from the danger of identifying noise as signal or overfitting on noise, which is very likely to be caused by massive searches or optimization adopted by BPET. • RDT provides an average of probability estimation which approaches the mean of true probabilistic values as more individual trees added.
Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? • The evidence (I) – Spect and Sonar, low-signal separability domains
Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? • The evidence (II) – Pima, a low-signal separability domain RDT: BPET:
Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? • The evidence (III) - Spam, a high-signal separability domain RDT: BPET:
Behind the Scenes- Why high separability categorical datasets with limited feature values hurt RDT? • The observations – Tic-tac-toe and Chess
Behind the Scenes- Why high separability categorical datasets with limited feature values hurt RDT? • The reason: • High separability categorical datasets with limited values tend to restrict the degree of diversity that RDT’s random feature selection can explore • - Random feature selection mechanism of RDT • Categorical features: once; • Continuous features: multiple times, but different splitting value each time.
Behind the Scenes- Why CFT preferable on low-signal separability datasets ? • The reasons • Low-signal separability domains • Good performance benefits from the probability aggregation mechanism • Rectify errors introduced to the probability estimates due to the attribute noise • High-signal separability domains • Aggregation of the estimated probabilities from the other irrelevant leaves will adversely affect the final probability estimates.
Behind the Scenes - Why CFT preferable on low-signal separability datasets ? • The evidence (I) – Spect and Pima, low-signal separability domains
Behind the Scenes - Why CFT preferable on low-signal separability datasets ? • The evidence (II) - Liver, a low-signal separability domain CFT: C4.4: 20
Choosing the Appropriate PET Algorithm Given a New Problem Signal-noise separability estimation through RDT or BPET Given dataset AUC Score High signal-noise separability < 0.9 >=0.9 Low signal-noise separability Ensemble Ensemble or Single trees Single Tree Ensemble or Single trees Feature types and value characteristics AUC MSE Error Rate Single Trees (AUC,MSE,ErrorRate) Categorical feature (with limited values) Ensemble (AUC,MSE,ErrorRate) Continuous features (categorical feature with a large number of values) MSE, ErrorRate AUC, MSE, ErrorRate AUC, MSE, ErrorRate AUC BPET CFT RDT RDT ( BPET) CFT C4.5 or C4.4
Summary • AUC: index of signal noise separability • Preference of a PET on multiple evaluation metrics • “signal-noise separability” of the dataset • other observable statistics. • Many important and unrevealed properties of PETs are analyzed • A practical guide for choosing the most appropriate PET algorithm
Thank you! Questions?