470 likes | 748 Views
Reliable Probability Forecasting – a Machine Learning Perspective. David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman, Volodya Vovk. Overview. What is probability forecasting? Reliability and resolution criteria Experimental design
E N D
Reliable Probability Forecasting – a Machine Learning Perspective David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman, Volodya Vovk
Overview • What is probability forecasting? • Reliability and resolution criteria • Experimental design • Problems with traditional assessment methods: square loss, log loss and ROC curves • Probability Calibration Graph (PCG) • Traditional learners are unreliable yet accurate! • Extension of Venn Probability Machine (VPM) • Which learners are reliable? • Psychological and theoretical viewpoint
Probability Forecasting • Qualified predictions important in many applications (especially medicine). • Most machine learning algorithms make “bare” predictions. • Those that do make qualified predictions make no claims of how effective the measures are!
True label unknown or withheld from learner Label Diagnosis ? Appendicitis Dyspepsia Dyspepsia Non-specific Object Patient Details , , ,..., Name: Sian Sex: F Height: 5’8” Name: Daniil Sex: M Height: 6’4” Name: Mark Sex: M Height: 6’1” Name: Wilma Sex: F Height: 5’6” Name: David Sex: M Height: 6’2” Test Object, what is the true label? Training Set to “learn” from Probability Forecasting: Generalisation of Pattern Recognition • Goal of pattern recognition = find the “best” label for each new test object. • Example Abdominal Pain Dataset:
= 0.7 Training set Name: Helen Sex: F Height: 5’6” Name: Helen Sex: F Height: 5’6” Name: Helen Sex: F Height: 5’6” Name: Helen Sex: F Height: 5’6” = 0.1 ? learner Test object = 0.2 Name: Helen Sex: F Height: 5’6” etc… Name: Helen Sex: F Height: 5’6” Name: Helen Sex: F Height: 5’6” Name: Helen Sex: F Height: 5’6” Probability Forecasting: Generalisation of Pattern Recognition • Probability forecast – estimate the conditional probability of a label given an observed object • We want learner to estimate probabilities for all possible class labels:
Probability forecasting more formally… • X object space, Y label space, Z = X Y example space • Our learner makes probability forecasts for all possible labels • Use probability forecasts to predict label most likely label:
Back to the plan… • What is probability forecasting? • Reliability and resolution criteria • Experimental design • Problems with traditional assessment methods: square loss, log loss and ROC curves • Probability Calibration Graph (PCG) • Traditional learners are unreliable yet accurate! • Extension of Venn Probability Machine (VPM) • Which learners are reliable? • Psychological and theoretical viewpoint
Studies of Probability Forecasting • Probability forecasting is well studied area since 1970’s: • Psychology • Statistics • Meteorology • These studies assessed two criteria of probability forecasts: • Reliability = the probability forecasts should not lie • Resolution = the probability forecasts are practically useful
Reliability • When an event is predicted with probability should have approx chance of being incorrect • a.k.a. well calibrated, • Considered an asymptotic property. • Dawid (1985) proved no deterministic learner can be reliable for all data – still interesting to investigate • This property is often overlooked in practical studies!
Resolution • Probability forecasts are practically useful, e.g. they can be used to rank the labels in order of likelihood! • Closely related to classification accuracy - common focus of machine learning. • Separate from reliability, i.e. do not go “hand in hand” (Lindsay, 2004)
Back to the plan… • What is probability forecasting? • Reliability and resolution criteria • Experimental design • Problems with traditional assessment methods: square loss, log loss and ROC curves • Probability Calibration Graph (PCG) • Traditional learners are unreliable yet accurate! • Extension of Venn Probability Machine (VPM) • Which learners are reliable? • Psychological and theoretical viewpoint
Experimental design • Tested several learners on many datasets in the online setting: • ZeroR = Control • K-Nearest Neighbour • Neural Network • C4.5 Decision Tree • Naïve Bayes • Venn Probability Machine Meta Learner (see later…)
The Online Learning Setting Learning machine makes prediction for new example. (label withheld) Before 7 6 1 7 ? ? 2 Update training data for learning machine for next trial After 7 6 1 7 2 ? 2 Repeat process for all examples
Lots of benchmark data • Tested on data available from the UCI Machine Learning repository: • Abdominal Pain: 6387 examples, 135 features, 9 classes, Noisy • Diabetes: 768 examples, 8 features, 2 classes • Heart-Statlog: 270 examples, 13 features, 2 classes • Wisconsin Breast Cancer: 685 examples, 10 features, 2 classes • American Votes: 435 examples, 16 features, 2 classes • Lymphography: 148 examples, 18 features, 4 classes • Credit Card Applications: 690 examples, 15 features, 2 classes • Iris Flower: 150 examples, 4 features, 3 classes • And many more…
Programs • Extended the WEKA data mining system implemented in Java: • Added VPM meta learner to existing library of algorithms • Allow learners to be tested in online setting. • Created Matlab scripts to easily create plots (see later)
Results, papers and website • All results that I discuss today can be found in my 3 tech reports: • The Probability Calibration Graph - a useful visualisation of the reliability of probability forecasts, Lindsay (2004), CLRC-TR-04-01 • Multi-class probability forecasting using the Venn Probability Machine - a comparison with traditional machine learning methods, Lindsay (2004), CLRC-TR-04-02 • Rapid implementation of Venn Probability Machines, Lindsay (2004), CLRC-TR-04-03 • And on my web site: • http://www.david-lindsay.co.uk/research.html
Back to the plan… • What is probability forecasting? • Reliability and resolution criteria • Experimental design • Problems with traditional assessment methods: square loss, log loss and ROC curves • Probability Calibration Graph (PCG) • Traditional learners are unreliable yet accurate! • Extension of Venn Probability Machine (VPM) • Which learners are reliable? • Psychological and theoretical viewpoint
Square loss Log loss Loss Functions • There are many other possible loss functions… • Degroot and Feinberg (1982) showed that all loss functions measure a mixtureofreliabilityandresolution • Log loss punishes more harshly: forced to spread its bets
ROC Curves • Graph shows trade off between false and true positive predictions • Want curve to be as close to the upper left corner as possible (away from diagonal) • My results show that this graph tests resolution. • Area under curve provides measure of quality of probability forecasts. Naïve Bayes on the Abdominal pain data set
55.6 (10) 0.74 (9) 1.1 (6) 0.49 (11) 34.6 (6) 0.73 (8) 2.1 (8) 0.59 (7) 41.6 (9) 0.58 (6) 0.9 (5) 0.61 (6) 33.4 (4) 1.0 (11) 2.6 (10) 0.54 (10) 33.4 (4) 0.96 (10) 2.2 (9) 0.55 (9) 34.3 (5) 0.47 (3) 0.73 (3) 0.74 (4) 30.5 (3) 0.45 (2) 0.72 (2) 0.75 (3) 39.6 (7) 0.67 (7) 3.3 (11) 0.57 (8) 40.7 (8) 0.54 (5) 0.8 (4) 0.76 (1) 29.2 (2) 0.50 (4) 1.3 (7) 0.72 (5) 28.9 (1) 0.44 (1) 0.6 (1) 0.75 (2) Table comparing traditional scores Algorithm Error Sqr Loss Log Loss ROC Area PCG ZeroR 1-NN VPM 1-NN 10-NN 20-NN 30-NN Neural Net C4.5 VPM C4.5 Naïve Bayes VPM Naïve Bayes
Problems with Traditional Assessment • Loss functions and ROC give more information than error rate about the quality of probability forecasts. • But… • loss functions = mixture of resolution and reliability • ROC curve = measures resolution • Don’t have any method of solely assessing reliability • Don’t have method of telling if probability forecasts are over- or under- estimated
Back to the plan… • What is probability forecasting? • Reliability and resolution criteria • Experimental design • Problems with traditional assessment methods: square loss, log loss and ROC curves • Probability Calibration Graph (PCG) • Traditional learners are unreliable yet accurate! • Extension of Venn Probability Machine (VPM) • Which learners are reliable? • Psychological and theoretical viewpoint
Murphy & Winkler (1977) Calibration data for precipitation forecasts Inspiration for PCG (Meteorology) Reliable points lie close to diagonal
Line of calibration Empirical frequency of being correct PCG coordinates Predicted Probability A PCG plot of ZeroR on Abdominal Pain Plot may not span whole axis – ZeroR makes no predictions with high probability Reliability PCG coordinates lie close to line of calibration i.e. ZeroR may is not accurate but it is reliable!
Naïve Bayes VPM Naïve Bayes Unreliable, forecast of 0.9 only has 0.55 chance being right! (over estimate) Unreliable, forecast of 0.1 only has 0.3 chance being right! (under estimate) PCG a visualisation tool and measure of reliability VPM is reliable as PCG follows the diagonal! Over and under estimates its probabilities – much like real doctors!
Naïve Bayes People Learners predicting like people! Lots of psychological research people make unreliable probability forecasts
Back to the plan… • What is probability forecasting? • Reliability and resolution criteria • Experimental design • Problems with traditional assessment methods: square loss, log loss and ROC curves • Probability Calibration Graph (PCG) • Traditional learners are unreliable yet accurate! • Extension of Venn Probability Machine (VPM) • Which learners are reliable? • Psychological and theoretical viewpoint
Table comparing scores with PCG Algorithm Error Sqr Loss Log Loss ROC Area PCG ZeroR 55.6 (10) 0.74 (9) 1.1 (6) 0.49 (11) 678.6 (3) 1-NN 34.6 (6) 0.73 (8) 2.1 (8) 0.59 (7) 4307.5 (9) VPM 1-NN 41.6 (9) 0.58 (6) 0.9 (5) 0.61 (6) 554.6 (2) 10-NN 33.4 (4) 1.0 (11) 2.6 (10) 0.54 (10) 5062.9 (11) 20-NN 33.4 (4) 0.96 (10) 2.2 (9) 0.55 (9) 4492.7 (10) 30-NN 34.3 (5) 0.47 (3) 0.73 (3) 0.74 (4) 921.2 (5) Neural Net 30.5 (3) 0.45 (2) 0.72 (2) 0.75 (3) 1320.5 (6) C4.5 39.6 (7) 0.67 (7) 3.3 (11) 0.57 (8) 3481.2 (8) VPM C4.5 40.7 (8) 0.54 (5) 0.8 (4) 0.76 (1) 838.1 (4) Naïve Bayes 29.2 (2) 0.50 (4) 1.3 (7) 0.72 (5) 2764.5 (7) VPM Naïve Bayes 28.9 (1) 0.44 (1) 0.6 (1) 0.75 (2) 496.7 (1)
Correlations of scores Scores Corr. Coeff. Interpretation ROC vs. Error -0.52 Inverse Moderate ROC vs. Sqr Resolution 0.67 Direct Strong ROC vs. Sqr Reliability -0.1 Inverse No PCG vs. Error 0.26 Direct Weak PCG vs. Sqr Resolution 0.04 Direct No PCG vs. Sqr Reliability 0.76 Direct Strong
Back to the plan… • What is probability forecasting? • Reliability and resolution criteria • Experimental design • Problems with traditional assessment methods: square loss, log loss and ROC curves • Probability Calibration Graph (PCG) • Traditional learners are unreliable yet accurate! • Extension of Venn Probability Machine (VPM) • Which learners are reliable? • Psychological and theoretical viewpoint
What is the VPM meta-learner? VPM “sits on top” of existing learner to complement predictions with probability estimates Learner Γ VPM meta learning framework • Volodya’s VPM • Predicts a label • Produces upper u and lower l bounds for predicted label only • My VPM extension • Extracts more information • Produces probability forecast for all possible labels • Predicts a label using these probability forecasts. • Produces Volodya’s bounds as well!
Upper (red) and lower (green) bounds lie above and below the actual number of errors (black) made on the data. Error rate and bounds Online Trial Number Up Error 2216.5 34.7% Error 1835 28.9% Low Error 1414.1 22.1% Volodya’s original use of VPM
Trial # Probability forecast for each class label Bounds Appx Div. Perf. Pept. Non. Spec Choli Intest obstr Pancr Renal. Dysp. Up Low VPM Naïve Bayes 1653 0.03 0.0 0.03 0.08 0.73 0.0 0.04 0.01 0.09 0.82 0.08 2490 0.02 0.03 0.10 0.07 0.05 0.15 0.08 0.09 0.4 0.71 0.07 5831 0.53 0.01 0.0 0.42 0.01 0.01 0.0 0.01 0.01 0.68 0.41 Naïve Bayes 1653 3.08e-9 4.5e-6 3.3e-6 4.4e-5 0.99 4.2e-3 3.4e-3 4.1e-10 1.3e-4 NA NA 2490 9.4e-5 0.01 0.17 2.3e-5 0.16 0.46 0.2 2.2e-7 2.2e-4 NA NA 5831 0.93 2.9e-9 1.7e-13 0.07 1.3e-9 2.2e-9 4.0e-11 6.3e-10 7.6e-9 NA NA Key: Predicted = underlined , Actual = Output from VPM compared with that of original underlying learner
Back to the plan… • What is probability forecasting? • Reliability and resolution criteria • Experimental design • Problems with traditional assessment methods: square loss, log loss and ROC curves • Probability Calibration Graph (PCG) • Traditional learners are unreliable yet accurate! • Extension of Venn Probability Machine (VPM) • Which learners are reliable? • Psychological and theoretical viewpoint
Heart Disease Lymphography Diabetes ZeroR • ZeroR outputs probability forecasts which are mere label frequencies • ZeroR predicts the majority class label at each trial. • Uses no information about the objects in its learning – the simplest of all learners. • Accuracy is poor, but reliability is good.
K-NN 20-NN 10-NN 30-NN • K-NN finds subset of K closest (nearest neighbouring) examples in training data using a distance metric. Then counts the label frequencies amongst this subset. • Acts like a more sophisticated version of ZeroR that uses information held in the object. • Appropriate choice of K must be made to obtain reliable probability forecasts (depends on data).
Neural Net Naïve Bayes C4.5 1-NN VPM Naïve Bayes VPM Neural Net VPM C4.5 VPM 1-NN Traditional Learners and VPM • Traditional learners can be very unreliable (yet accurate) - depends on data. • My research shows empirically that VPM is reliable. • And it can recalibrate a learners original probability forecasts to make them more reliable! • Improvement in reliability often without detriment to classification accuracy.
Back to the plan… • What is probability forecasting? • Reliability and resolution criteria • Experimental design • Problems with traditional assessment methods: square loss, log loss and ROC curves • Probability Calibration Graph (PCG) • Traditional learners are unreliable yet accurate! • Extension of Venn Probability Machine (VPM) • Which learners are reliable? • Psychological and theoretical viewpoint
Psychological Heuristics • When faced with the difficult task of judging probability, people employ a limited number of heuristics which reduce the judgements to simpler ones: • Availability - An event is predicted more likely to occur if it has occurred frequently in the past • Representativeness - One compares the essential features of the event to those of the structure of previous events • Simulation - The ease in which the simulation of a system of events reaches a particular state can be used to judge the propensity of the (real) system to produce that state.
Interpretation of reliable learners using heuristics • ZeroR, K-NN and VPM learners are reliable probability forecasters. • Can identify heuristics in these learning algorithms • Remember psychological research states: More heuristics More reliable forecasts
Psychological Interpretation of ZeroR • The simplest of all reliable probability forecasters uses 1 heuristic: • The learner merely counts labels it has observed so far, and uses the frequencies of labels as its forecasts (Availability)
Psychological Interpretation of K-NN • More sophisticated than the ZeroR learner, the K-NN learner uses 2 heuristics: • Uses the distance metric to find subset of K closest examples in training set. (Representativeness) • Then counts the label frequencies in the subset of K-nearest neighbours to makes its forecasts (Availability)
Psychological Interpretation of VPM • Even more sophisticated the VPM meta-learner uses all 3 heuristics: • The VPM tries each new test example with all possible classifications (Simulation) • Then under each tentative simulation clusters training examples which are similar into groups (Representativeness) • Finally the VPM calculates the frequency of labels in each of these groups to make its forecasts (Availability)
Theoretical justifications • ZeroR can be proven to be asymptotically reliable (but experiments show well in finite data) • K-NN has lots of theory Stone (1977) to support its convergence to true probability distribution • VPM has a lots of theoretical justification for finite data using martingales
Take home points • Probability forecasting is useful for real life applications especially medicine. • Want learners to be reliable and accurate. • PCG can be used to check reliability. • ZeroR, K-NN and VPM provide consistently reliable probability forecasts. • Traditional learners Naïve Bayes, Neural Net and Decision Tree can provide unreliable forecasts. • VPM can be used to improve reliability of probability forecasts without detriment to classification accuracy.
Supervision Alex Gammerman Volodya Vovk Zhiyuan Luo Mathematical Advice Daniil Riabko Volodya Vovk Teo Sharia Proofreading Zhiyuan Luo Siân Cox Graphics & Design Siân Cox Catering Siân Cox Fin Acknowledgments