500 likes | 674 Views
Thoughts on model assessment. Porco, DAIDD , December 2012. Assessment. Some models are so wrong as to be useless. Not really possible to argue that a model is “true” Treacherous concept of “validation”: has a model that has been “validated” been shown to be “valid”? Philosophy of science.
E N D
Thoughts on model assessment Porco, DAIDD, December 2012
Assessment • Some models are so wrong as to be useless. • Not really possible to argue that a model is “true” • Treacherous concept of “validation”: has a model that has been “validated” been shown to be “valid”? • Philosophy of science
Face Validity • Knowledge representation • Analogy • Does the model represent in some approximate sense facts about the system?
Verification • Does your simulation actually simulate what you think it simulates? • How do you know your program is not generating nonsense? • Unit testing • Boundary cases • “Test harnesses” • Formal software testing techniques
Thought • “…the gold standard of model performance is predictive power on data that wasn’t used to fit your model…” --Conway and White, Machine Learning for Hackers
Prediction • Not everything is predictable • Predict individuals? Communities? Counterfactuals? • Does the model somehow correspond to reality?
Probabilistic forecasting • Weather • Forecast expressed as a probability • Correctly quantify uncertainty • Uncertainty is needed to make the best use of a forecast • If you don’t know, don’t say you do • Not bet-hedging, but honesty
Two approaches • Assessment of probabilistic forecasts • Brier Score • Information Score
Assessing simple forecasts • Binary event: 0 for no, 1 for yes • Forecast: p is the probability it occurred • If p=1, you forecast it with certainty • If p=0, you forecast that it was certain not to have occurred • If 0<p<1, you were somewhere in the middle
Brier score • Squared error of a probabilistic forecast • BS = (p-1)2 + ((1-p)-0)2 • BS = 2(1-p)2 • Brier score has a negative orientation (like in golf, smaller is better) • Some authors do not use the factor of 2
Brier score • Example. Suppose I compute a forecast that trachoma will be eliminated in village A after two years as measured by pooled PCR for DNA in 50 randomly selected children. Suppose I say the elimination probability is 0.8. • If trachoma is in fact eliminated, the Brier score for this is (0.8-1)^2 + (0.2-0)^2 = [you work it out]
Brier score • What is the smallest possible Brier score? • What is the largest possible Brier score?
Brier score • Suppose we now forecast elimination in five villages. We can compute an overall Brier score by just adding up separate scores for each village.
Murphy decomposition • Classical form applies to binary predictions • Predictions are probabilities • Finite set of possible probabilities • Example: chance of a meteorological event happening might be predicted as 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%.
Reliability (sensu Murphy) • Similar to Brier score – less is better • Looking at all identical forecasts, how similar is what really happened to what was forecast?
Example • Here, we get a total Brier score of 6.96. We have to add up (0.2-0)^2 + (0.2-1)^2+… (and not forget the factor of two).
Some terminology • Forecast levels: the different values possible for the forecast. Notation: fk for the levels of k • Observed means at the forecast levels: averaging the number of times the event happened stratifying on forecast level. Notation: • We have two forecast levels in the example, 0.2 and 0.8. For forecast level 0.2, we have an observed mean of 2/6. For the forecast level 0.8, we have an observed mean of 3/6.
Examples • We had 2/6 that were YES when we gave a 20% chance. Let’s compare the prediction to the successes just for the 20% predictions. For each one we have the same thing: (0.2 – 2/6)2+(0.8-4/6)2 which is about 0.0356. • We had 3/6 that were YES when we gave an 80% chance. We do a similar computation and we get (0.8-3/6)2 + (0.2-3/6)2=0.222 or so.
Reliability • So the total reliability score is going to be computed: This yields a total score of approximately 1.2933 for the reliability component REL.
Resolution • When we said different things would happen, how different really were they? We’re going to contrast the observed means at different forecast levels. • Specifically, we want the variance of the distribution of the observed means at different forecast levels.
Example • Using the simple example: the overall mean is 5/12. Then compute 2*(2/6 – 5/12)2 for every observation in the first group, and 2*(3/6-5/12)2 for every observation in the second group. • The total resolution component RES is 6*2*(2/6 – 5/12)2 +2*(3/6-5/12)2= 1/6.
Uncertainty • In the classical Murphy formula, this is computed by calculating the mean observation times one minus the mean—and adding this up for each observation. • For our example, the uncertainty is 5.8333 or so.
Murphy decomposition • Murphy (1973) showed that the Brier score can be decomposed as follows: • BS=REL-RES+UNC • N.B. the negative sign in front of resolution • High uncertainty contributes to high Brier score (all else being equal) • High discrepancy of the observed means at each level from the forecasts raises the Brier score • But having those observed means at each level separate from each other lowers the Brier score
Script brier.component <- function(forecast,observation) { 2*(forecast-observation)^2 } reliability <- function(fk,ok) { 2*(fk-ok)^2 } resolution <- function(obar,ok) { 2*(ok-obar)^2 } gen.obark <- function(predictions,outcome,key) { tmp <- data.frame(key=key,outcome=outcome,pred=predictions) f1 <- merge(tmp, ddply( tmp, .(pred), function(x){mean(x$outcome)} ), by="pred") f1 <- f1[order(f1$key),] f1$V1 }
Example ds <- c(0,1,0,0,0,1,1,0,0,0,1,1) fs <- c(rep(0.2,6),rep(0.8,6)) obark <- gen.obark(fs,ds,1:12) rel <- sum(reliability(fs,obark)) res <- sum(resolution(mean(ds),obark)) unc <- 12*2*(mean(ds)*(1-mean(ds))) brs <- sum(brier.component(fs,ds)) (rel-res+unc)-brs [1] -1.776357e-15
General form B: Brier score; K, number of forecast levels; nk the number of forecasts at level k Other quantities as defined earlier
More general forms • More general decompositions (Stephenson, five terms) • Continuous forms
Alternatives to Brier score • Decomposition into Reliability, Resolution, Uncertainty works for information measures (Weijset al)
Information measures • Surprise • Expected surprise • Surprising a model
Surprisal • How much do you learn if two equally likely alternatives are disclosed to you? • Heads vs tails; 1 vs 2
Surprisal • How much do you learn if 3 are disclosed? • A, B, or C • How much if I disclose one of 26 equally likely outcomes? You should have learned more
Surprisal • Standard example: • Now combine two independent things. If we had 1 vs 2, and A, B, or C: • A/1 • C/2 • B/1 • B/2 • …
Surprisal • Six equally likely things • s(2 x 3) = s(2) + s(3) • Uniquely useful way to do this: define surprisalas log(1/p), log of the reciprocal probability • Stevens information tutorial online
Surprisal • Different bases can be used for the logarithm • Log base 2 is traditional • How much information is transmitted if you learn which of two equally likely outcomes happened? • Log2(1/(1/2)) = 1 • If we use base two, then the value is 1, referred to as 1 bit. • Disclosure of one of two equally likely outcomes reveals one bit of information. • Notation: log base 2 often written lg
Example • Sequence of values of flips of a bent coin with P(heads)=1/3, P(tails)=2/3 The averagesurprisal was 1.085 or so.
Expected surprisal • Every time an event with probability pi happens, the surprisal is lg(1/pi). • What is the expectedsurprisal?
Expected surprisal Shannon entropy
Shannon entropy • Shannon entropy can be used to quantify the amount of information provided by a model for a categorical outcome, in much the same way that the squared multiple correlation coefficient can quantify the amount of variance explained by a continuous model • Use of entropy and related measures to assess probabilistic predictions is sometimes recommended (Weijs)
Estimated entropy • If you have a Bernoulli trial, what is the entropy? The success probability is p. • H=-(1-p)lg(1-p) – p lg(p) • If we estimate p from data and plug this in, we get an estimator of the entropy. • Example: if we observe 8 successes in 20 trials, the observed frequency is 0.4, and the estimated entropy would be about 0.971.
What does a model do? • If we now have a statistical model that provides a better prediction, we can compute the conditional entropy. • Without the model, our best estimate of the chance of something is just the observed relative frequency. • But with the model, maybe we have a better estimate.
Example • Logistic regression: use data from the Mycotic Ulcer Treatment Trial (Prajnaet al 2012, Lietman group). • Using just the chance of having a transplant or perforation, the estimated entropy is 0.637. About 16% of patients had such an event.
Example • Now, let’s use a covariate, such as what treatment a person got. • Now, we have a better (we hope) estimate of the chance of a bad outcome for each person. These are 11% for treatment 1 and 21% for treatment 2.
Example • For group 1, the chance of a bad outcome was 11%. The entropy given membership in this group is about 0.503. • For group 2, the chance of a bad outcome was 21% or so. The entropy given membership in this group is about 0.744. • The weighted average of these is the conditional entropy given the model. This gives us just 0.623.
Example • So the entropy was 0.636, and now it’s 0.623. The treatment model does not predict very much. • If we look at the proportional reduction, we get 2.2%; this model explained 2.2% of the uncertainty. • This is exactly the McFadden R2 you get in logistic regression output.
Summary • Probabilistic forecasts can be assessed using the Brier score. • The Brier score can be decomposed into reliability, resolution, and uncertainty components. • Information theoretic measures can also be used in assessment. • Information theoretic measures are useful in other biostatistical applications as well.