680 likes | 866 Views
Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful. Kevin Murphy UBC CS & Stats 9 February 2005. Where does the title come from?. “Why I am not a Bayesian”, Glymour, 1981 “Why Glymour is a Bayesian”, Rosenkrantz, 1983
E N D
Why I am a Bayesian(and why you should become one, too)orClassical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005
Where does the title come from? • “Why I am not a Bayesian”, Glymour, 1981 • “Why Glymour is a Bayesian”, Rosenkrantz, 1983 • “Why isn’t everyone a Bayesian?”,Efron, 1986 • “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001 Many other such philosophical essays…
Prob = objective relative frequencies Params are fixed unknown constants, so cannot write e.g. P(=0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Frequentist vs Bayesian Source: “All of statistics”, Larry Wasserman
Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?
Coin flipping HHTHT HHHHH What process produced these sequences? The following slides are from Tenenbaum & Griffiths
statistical models Hypotheses in coin flipping Describe processes by which D could be generated • Fair coin, P(H) = 0.5 • Coin with P(H) = p • Markov model • Hidden Markov model • ... D = HHTHT
generative models Hypotheses in coin flipping Describe processes by which D could be generated • Fair coin, P(H) = 0.5 • Coin with P(H) = p • Markov model • Hidden Markov model • ... D = HHTHT
Graphical model notation Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation d1d2 d3 d4 Fair coin, P(H) = 0.5 d1d2 d3 d4 Markov model HHTHT d1d2 d3 d4 d5 Representing generative models
p d1d2 d3 d4 P(H) = p s1s2 s3 s4 HHTHT d1d2 d3 d4 Hidden Markov model d1d2 d3 d4 d5 Models with latent structure • Not all nodes in a graphical model need to be observed • Some variables reflect latent structure, used in generating D but unobserved How do we select the “best” model?
Likelihood Prior probability Posterior probability Bayes’ rule Sum over space of hypotheses
The origin of Bayes’ rule • A simple consequence of using probability to represent degrees of belief • For any two random variables:
Why represent degrees of belief with probabilities? • Good statistics • consistency, and worst-case error bounds. • Cox Axioms • necessary to cohere with common sense • “Dutch Book” + Survival of the Fittest • if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. • Provides a theory of incremental learning • a common currency for combining prior knowledge and the lessons of experience.
Hypotheses in Bayesian inference • Hypotheses H refer to processes that could have generated the data D • Bayesian inference provides a distribution over these hypotheses, given D • P(D|H) is the probability of D being generated by the process identified by H • Hypotheses H are mutually exclusive: only one process could have generated D
Coin flipping • Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p
Coin flipping • Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p
Comparing two simple hypotheses • Contrast simple hypotheses: • H1: “fair coin”, P(H) = 0.5 • H2:“always heads”, P(H) = 1.0 • Bayes’ rule: • With two hypotheses, use odds form
Bayes’ rule in odds form = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Prior odds Posterior odds Bayes factor(likelihood ratio)
Data = HHTHT = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000 P(H1|D) / P(H2|D) = infinity
Data = HHHHH = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D) 30
Data = HHHHHHHHHH = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1P(H2) = 1/1000 P(H1|D) / P(H2|D) 1
Coin flipping • Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p
p d1d2 d3 d4 d1d2 d3 d4 Fair coin, P(H) = 0.5 P(H) = p Comparing simple and complex hypotheses • Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p? vs.
Comparing simple and complex hypotheses • P(H) = p is more complex than P(H) = 0.5 in two ways: • P(H) = 0.5 is a special case of P(H) = p • for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5
Comparing simple and complex hypotheses Probability
Comparing simple and complex hypotheses Probability HHHHH p = 1.0
Comparing simple and complex hypotheses Probability HHTHT p = 0.6
Comparing simple and complex hypotheses • P(H) = p is more complex than P(H) = 0.5 in two ways: • P(H) = 0.5 is a special case of P(H) = p • for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 • How can we deal with this? • frequentist: hypothesis testing • information theorist: minimum description length • Bayesian: just use probability theory!
Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x
Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Marginal likelihood likelihood Prior
Likelihood and prior • Likelihood: P(D | p) = pNH(1-p)NT • NH: number of heads • NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 ?
A simple method of specifying priors • Imagine some fictitious trials, reflecting a set of previous experiences • strategy often used with neural networks • e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair • In fact, this is a sensible statistical idea...
Likelihood and prior • Likelihood: P(D | p) = pNH(1-p)NT • NH: number of heads • NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 • FH: fictitious observations of heads • FT: fictitious observations of tails Beta(FH,FT) (pseudo-counts)
Posterior / prior x likelihood • Prior • Likelihood • Posterior Same form!
Conjugate priors • Exist for many standard distributions • formula for exponential family conjugacy • Define prior in terms of fictitious observations • Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000
Normalizing constants • Prior • Normalizing constant for Beta distribution • Posterior • Hence marginal likelihood is
Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Likelihood for H1 Marginal likelihood (“evidence”) for H2
Marginal likelihood for H1 and H2 Probability Marginal likelihood is an average over all values of p
Bayesian model selection • Simple and complex hypotheses can be compared directly using Bayes’ rule • requires summing over latent variables • Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” • Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)
Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?
Example: Belgian euro-coins • A Belgian euro spun N=250 times came up heads X=140. • “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002) Source: Mackay exercise 3.15
Classical hypothesis testing • Null hypothesis H0 eg. q = 0.5 (unbiased coin) • For classical analysis, don’t need to specify alternative hypothesis, but later we will useH1: 0.5 • Need a decision rule that maps data D to accept/ reject of H0. • Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2
P-values • Define p-value of threshold as • Intuitively, p-value of data is probability of getting data at least that extreme given H0
P-values R • Define p-value of threshold as • Intuitively, p-value of data is probability of getting data at least that extreme given H0 • Usually choose so that false rejection rate of H0 is below significance level = 0.05
P-values R • Define p-value of threshold as • Intuitively, p-value of data is probability of getting data at least that extreme given H0 • Usually choose so that false rejection rate of H0 is below significance level = 0.05 • Often use asymptotic approximation to distribution of d(D) under H0 as N !1
P-value for euro coins • N = 250 trials, X=140 heads • P-value is “less than 7%” • If N=250 and X=141, pval = 0.0497, so we can reject the null hypothesis at the significance level of 5%. • This does not mean P(H0|D)=0.07! Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)
Bayesian analysis of euro-coin • Assume P(H0)=P(H1)=0.5 • Assume P(p) ~ Beta(,) • Setting =1 yields a uniform (non-informative) prior.
Bayesian analysis of euro-coin • If =1,so H0 (unbiased) is (slightly) more probable than H1 (biased). • By varying over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis. • Other priors yield similar results. • Bayesian analysis contradicts classical analysis.
Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?
Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach? • Violates likelihood principle • Violates stopping rule principle • Violates common sense