320 likes | 357 Views
Evaluation of Probability Forecasts. Shulamith T Gross C ity U niversity of N ew Y ork B aruch C ollege Joint work with Tze Leung Lai, David Bo Shen: S tanford U niversity Catherine Huber U niversitee R ene D escartes Rutgers 4.30.2012. Quality of Prediction.
E N D
Evaluation of Probability Forecasts Shulamith T Gross City University of New York Baruch College Joint work with Tze Leung Lai, David Bo Shen: Stanford University Catherine Huber Universitee Rene Descartes Rutgers 4.30.2012
Quality of Prediction • Repeated Probability Predictions are commonly made in Meteorology, Banking and Finance, Medicine & Epidemiology • Define: pi = true event probability p’i = (model) predicted future event prob. Yi = Observed event indicator Prediction quality: Calibration (Accuracy) Discrimination: (Resolution, Precision)
OUTLINE • Evaluation of probability forecasts in meteorology & Banking (Accuracy) • Scores (Brier, Good), proper scores, Winkler’s skill scores (w loss or utility) L’n = n-1 Σ 1≤i≤n (Yi – p’i)2 • Reliability diagrams: Bin data according to predicted riskp’iand plot the observed relative frequency of events in each bin center.
OUTLINE 2: Measures of Predictiveness of models • Medicine: Scores are increasingly used. Epidemiology: Curves: ROC, Predictiveness • Plethora of indices of reliability/discrimination: • AUC = P[p’1 > p’2| Y1=1, Y2=0] (Concordance. One predictor) • Net Reclassification Index. NRI = P[p’’>p’|Y=1]-P[p’’>p’|Y=0]- {P[p’’<p’|Y=1]-P[p’’<p’|Y=0]} • Improved Discrimination Index (“sign”“-”) IDI= E[p’’-p’|Y=1] - E[p’’-p’|Y=0] (two predictors) • R2 = 1 – Σ(Yi-p’i)2/Σ (Yi-Ybar)2R-sq differences
Evaluation of Probability Forecasts Using Scoring Rules • Reliability, or Precision is measured using “scoring rules“ (loss or utility) L’n = n-1 Σ 1≤i≤n L(Yi ,p’i) • First score BRIER(1950) used square error loss. Most commonly used. • Review: Geneiting &Raftery (2007) • Problem: very few inference tools are available
Earlier Contributions • Proper scores (using loss functions) • Ep[L(Y,p)] ≤ Ep[L(Y,p’)] • Inference: Tests of H0: pi = p’i all i=1,…,n • Cox(1958) in model logit(pi)= b1+b2logit(p’i) • Too restrictive. Need to estimate the true score-not test for perfect prediction. • Related work: Econometrics. Evaluation of linear models. Emphasis on predicting Y. Giacomini and White (Econometrica 2006)
Statistical Considerations • L’n = n-1 Σ 1≤i≤n L(Yi ,p’i) attempts to estimate Ln = n-1 Σ 1≤i≤n L(pi ,p’i) ‘population parameter’ • Squared error loss • L(p,p’) = (p – p’)2 • Kullback-Leibler divergence (Good, 1952): • L(p,p’) = p log(p/p’) + (1-p) log( (1-p)/(1-p’)) • How good is L’ in estimating L?
Linear Equivalent Loss • A loss function L~(p,p’) is a linear equivalent of the loss function L(p,p’) if • It isa linear function of p • L(p,p’) - L~(p,p’) does not depend on p’ • E.g. • L~(p,p’) = - 2p p’ + p’2 • is a linear equivalent of the squared error loss L(p,p’)=(p – p’)2. • A linear equivalent L˜ of the Kullback-Leibler divergence is given by the Good (1952) score L˜(p,p’) = - {p log(p’) + (1-p) log(1- p’)} • L(p,p’) = |p-p’| has no linear equivalent
LINEAR EQUIVALENTS-2 • We allow all forecast p’kto depend on an information set Fk-1 consisting of all event, forecast histories, and other covariates before Yk is observed, as well as the true p1…pk. The conditional distribution of Yi given Fi-1 is Bernoulli(pi), with • P(Yi = 1|Fi-1) = pi • Dawid (1982, 1993) Martingale usein testing hyp context • Suppose L(p; p’) is linear in p, as in the case of linear equivalents of general loss functions. Then • E[ L(Yi, p’i )|Fi-1] = L(pi,p’i) • so L(Yi,p’i) - L(pi,p’i) is a martingale differencesequence with respect to {Fi-1}.
Theorem 1. Suppose L(p; p’) is linear in p. Let σ2n= Σ1≤ i ≤n {L(1,p’i) - L(0,p’i)}2 pi(1- pi). If σ2nconverges in probability to some non-random positive constant, then √n (L’n - Ln)/σn has a limiting standard normal distribution. ________________________________________________ We can set confidence intervals for the ‘true’ score if we can estimate pi(1- pi).Upper bound: ¼, leads to conservative CI Alternatively:Create buckets (bins) of cases with close event probability values (Banking) or use categorical covariates(Epidemiology).
Difference of Scores to replace Skill Scores • When the score has a linear equivalent, the score difference • D’n = n-1Σ1≤ i ≤ n {L(Yi,p’’i) - L(Yi,p’i)} • is Linear in pi anda martingale wrt Fn • Does not depend on p’i or p’’i • Theorem 2. If the score L has a linear equivalent, and Dn =n-1Σ1 ≤ i ≤ n{L(pi,p’’i) - L(pi,p’i)]} • di = L(1,p’’i) - L(0,p’’i) – [L(1, p’i) - L(0, p’i)] • s2n= Σ1≤ i ≤ndi2pi(1- pi) • Then if s2n converges in probability to a positive constant, then √n(D’n- Dn)/sn converges in law to N(0,1). If the score has no linear equivalent, the theorem holds with • Dn = n-1Σ1≤ i ≤n{di pi + [L(0,p’’i) - L(0,p’i)]}
A typical application in climatology is the following: in which pt is the predicted probability of precipitation t days ahead. These confidence intervals, which are centered at BRk – BRk-1, are given in the table below. BR1Δ(2) Δ(3) Δ(4) Δ(5) Δ(6) Δ(7) ________________________________________ Queens, .125 .021 .012 .020 .010 .015 .007 NY ±.010 ±.011 ±.012 ±.011 ±.011 ±.010 Jefferson .159 .005 .005 .007 .024 .000 .008 City, MO ±.010 ± .011 ±.011 ±.010 ±.010 ±.008 Brier scores B1 and Conservative 95% confidence intervals for Δ(k).
Additional Results Useful for Variance Estimation and Binning • If a better estimate of pi(1-pi) is needed: use binning or bucketing of the data. Notation • Predictions are made at time t in 1:T. • At t, there are Jtbuckets of predictions • Bucket size njt • p(j) = true probability common to all i in bucket j, for j in 1:Jt. • In statistical applicationsbuckets would be formed by combinations of categorical variables.
Buckets for One or Two Scores • Theorem 3. (for constant probability buckets) If njt≥2 for all j in 1:Jtand pi=pjt for all i in 1:Ijt then • Under conditions of Theorem 1: • σ2’n - σ2n = op(1), where • s2’n= Σ I[t є1:T,jє1:Jt,iєIjt] • {L(1,p’i) - L(0,p’i)}2 v’2t(j) • and v’2t(j) is the sample variance of the Y’s in bucket Ij,t • Under conditions of Theorem 2 • s’2n - s2n = op(1) where s’2n= Σtє1:T,jє1:Jt,iєIjtdi2 v2t(j)
Notations for Quasi-Buckets (bins) • Bins B1, B2, … , BJt Usually Jt=J • Quasi buckets: A bin defined by grouping cases on their predicted probability p’ (not on true p): Ijt = {k, p’k,tє Bj} i=(k,t); nj= Σt njt • Recall, Simple buckets: pi for all iє Ijt are equal • Ybart(j) = Σi єIjtYi/njt • Ybar(j) =Σtє1:TΣi єIjtYi /nj quasi-bucketj • pbar(j) =Σtє1:TΣi єIjtpi /nj • v’(j) = Σt є 1:T njt v’t(j)/nj estimated variance inbin j where • v’t(j) = Ybart(j) (1- Ybart(j) ) (njt / (1- njt ))
Quasi Buckets and the Reliability Diagram Application of quasi-buckets: • The Reliability Diagram = graphical display of calibration : % cases in a bin vs bin-center • BIN= all subjects within same decile (Bj) of predicted risk • Note: Bin Ijt is measurable w.r.t. Ft-1. • Define, for one and two predictors: • s’2n= n-1Σt є1:TΣjєJt ΣiєIjt {L(1,p’i) - L(0,p’i)}2 * • (Yi-Ybar,t(j))2 njt/(njt-1) • s’2n = n-1Σt є1:TΣjєJt ΣiєIjt di2*(Yi-Ybar,t(j))2 njt/(njt-1) • Note that the difference between s’2n and the original s’2n in theorem 1. pi(1-pi) for all i in the same bucket is estimated by a square distance from a single bucket mean. Since the pi’s vary within the bucket, a wrong centering is used, leading to theorem5.
The simulation • t=1,2 Jt=5 • Bin(j) = ((j-1)/5, j/5) j=1,…,5 • True probabilities: pjt ~ Unif(Bin(j)) • nj=30 for j=1,…,5 • Simulate: • 1. Table of true and estimated bucket variances • 2. Reliability diagram using confidence intervals from Theorem 5
Theorem 5 for quasi-buckets • Assume that bucket j at time t, Ijt is Ft-1 measurable for buckets at time t (j in 1:Jt). • Then under the assumptions of Theorem 1, • s’2 ≥ s2 + o(1) a.s. • with equality if all p’s in a bucket are equal. • Under the assumptions of Theorem 2, • s’2 ≥ s2 + o(1) a.s. • with equality if all p’s in a bucket are equal.
Quasi-buckets Theorem 5 continued • If nj/n converges in probability to a positive constant, and • v(j) = Σt є 1:T Σi є Ijt pi(1-pi)/nj • Then(nj/v(j))1/2 (Ybar(j) – pbar(j)) converges in distribution to N(0,1) and • v’(j) ≥ v(j) +op(1)
Simulation results for a 5 bucket example • Min Q1 Q2 Q3 Max Mean SD pbar(2) 0.101 0.300 0.300 0.355 0.515 0.320 0.049 Ybar(2) 0.050 0.267 0.317 0.378 0.633 0.319 0.089 • v(2) 0.087 0.207 0.207 0.207 0.247 0.209 0.015 • v’(2) 0.048 0.208 0.221 0.239 0.259 0.213 0.034 Pbar(5) 0.769 0.906 0.906 0.906 0.906 0.895 0.026 Ybar(5) 0.733 0.867 0.900 0.933 1.000 0.892 0.049 • v(5) 0.082 0.082 0.082 0.082 0.164 0.088 0.016 • v’(5) 0.077 0.084 0.093 0.120 0.202 0.096 0.037
Buckets help resolve bias problem • BIAS for Brier’s score w L(p, p’)=(p-p’)2 • E[L’n – Ln|Fn-1 ]= n-1Σ1≤ i ≤ npi(1-pi) Using the within-bucket sample variances we • estimate the bias • fix the variance of the resulting estimate • obtain asyptotic normality & • estimates of the new asymptotic variance.
Indices of reliability and discrimination in epidemiology Our work applies to ‘in the sample prediction’ e.g. • ‘Best’ model based on standard covariates for predicting disease development • Genetic or other marker(s) become available. Tested on ‘training’ sample. Estimated models used to predict disease development in test sample. • Marker improves model fit significantly. • Does it sufficiently improve prediction of disease to warrant the expense?
Goodness of prediction in Epidemiology • Discrimination: Compare the predictive prowess of two models by how well model 2 assigns higher risk to cases and lower risk to controls, compared to model 1. Pencina et al (2008, Stat in Med) call these indices • IDI = Improved Discrimination Index • NRI = Net reclassification Index • Note: in/out of sample prediction here • Much work: Gu, Pepe (2009); Uno et al Biometrics(2011)
Current work and simulations • Martingale approach applies to “out of the sample” evaluation of indices like NRI and IDI, AUC, and R2 differences. • For Epidemiological applications, in the sample model comparison for prediction, assuming: • (Y, Z) iid (responseindicator, covariate vector) • neither model is necessarily the true model • models differ by one or more covariates. • larger model significantly better than smaller model • logistic or similar models, parameters of both models estimated by MLE we proved asymptotic normality, and provided variance estimates, for the IDI and Brier score difference. IDI2/1= E[p’’-p’|Y=1] - E[p’’-p’|Y=0] BRI=BR(1)-BR(2) • Simulation results • Real data from a French Dementia study.
On the Three City Study • Cohort study. • Purpose:Identify variables that help predict dementia in people over 65. • Our sample: n = 4214 individuals. 162 developed Dementia within 4 years. • http://www.three-city-study.com/baseline-characteristics-of-the-3c-cohort.php • Ages: 65-74 55%Educ: Primary School 33% • 75-79 27% High School 43% • 80+ 17% BeyondHS 24% • Original data: Gender Men 3650 Women 5644 at start
Three City Occupation & Income Occupation Men (%) Women (%) SENIOR EXEC 33 11 MID EXEC 24 18 OFFICE WORKER 14 38 SKILLED WORKER 29 17 HOUSEWIFE 16 Annual Income >2300 Euro 45 24 1000-2300 29 25 750-1000 19 36 <750 2 8
The predictive model w Marker • TABLE V • LOGISTIC MODEL 1 INCLUDING THE GENETIC MARKER • APOE4. • Estimate Std. Error Pr(>|z|) • (Intercept) -2.944 0.176 < 2e-16 • age.fac.31 -2.089 0.330 2.3e-10 • age.fac.32 -0.984 0.191 2.5e-07 • Education -0.430 0.180 0.0167 • Cardio 0.616 0.233 0.0081 • Depress 0.786 0.201 9.5e-05 • incap 1.180 0.206 1.1e-08 • APOE4 0.634 0.195 0.0012 • AIC = 1002 • Age1: 65<71. Age2 71 <78
The predictive model w/o Marker • TABLE VI • MODEL 2 : LOGISTIC MODEL WITHOUT THE GENETIC MARKER • APOE4. • Estimate Standard Error p-value • (Intercept) -2.797 0.168 < 2e-16 • age.fac.31 -2.060 0.330 4.0e-10 • age.fac.32 -0.963 0.190 4.2e-07 • Education -0.434 0.179 0.0155 • card 0.677 0.231 0.0034 • depress 0.805 0.201 6.2e-05 • incap 1.124 0.206 4.6e-08 • AIC = 1194 • Age1: 65<71. Age2 71 <78
The Bootstrap and Our Estimates • SAMPLE AND BOOTSTRAP ESTIMATES FOR IDI. • Sample Bootstrap Sample Bootstrap • Mean Mean Std Err Std Err • -.00298 -.00374 0.00303 0.00328 • SAMPLE AND BOOTSTRAP ESTIMATES FOR BRI. • Sample Bootstrap Sample Bootstrap • Mean Mean Std Err Std Err • -6.09e-05 -9.02e-05 0.000115 0.000129
Asymptotic and Bootstrap Confidence Intervals for IDI and BRI – 3 CITY STUDY • 95% Confidence Interval for IDI • Asymptotic -0.008911 +0.002958 • Bootstrap -0.012012 +0.000389 • 95% Confidence Interval for BRI • Asymptotic -2.87e-04 +1.65e-04 • Bootstrap -4.08e-04 +9.03e-05 • Additional QQ normal plots for Bootstrap distribution
Discussion and Summary • We have provided a probabilistic setting for inference on prediction scores and provided asymptotic results for discrimination and accuracy measures prevalent in life sciences in within the sample prediction . • In-sample prediction: Assumed chosen models need not coincide with true model. Obtain Asymptotics for IDI and BRI. Convergence is slow for models for which true coefficients are small. • We considered predicting P[event] only. Different problems pop up for prediction of a discrete probability distribution. Certainly our setting can be used but scores have to be carefully chosen. • What about predicting ranks? (Sports. Internet search engines)