Kevin Murphy UBC CS & Stats 9 February 2005

Why I am a Bayesian(and why you should become one, too)orClassical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005

Where does the title come from? • “Why I am not a Bayesian”, Glymour, 1981 • “Why Glymour is a Bayesian”, Rosenkrantz, 1983 • “Why isn’t everyone a Bayesian?”,Efron, 1986 • “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001 Many other such philosophical essays…

Prob = objective relative frequencies Params are fixed unknown constants, so cannot write e.g. P(=0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Frequentist vs Bayesian Source: “All of statistics”, Larry Wasserman

Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?

Coin flipping HHTHT HHHHH What process produced these sequences? The following slides are from Tenenbaum & Griffiths

statistical models Hypotheses in coin flipping Describe processes by which D could be generated • Fair coin, P(H) = 0.5 • Coin with P(H) = p • Markov model • Hidden Markov model • ... D = HHTHT

generative models Hypotheses in coin flipping Describe processes by which D could be generated • Fair coin, P(H) = 0.5 • Coin with P(H) = p • Markov model • Hidden Markov model • ... D = HHTHT

Graphical model notation Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation d1d2 d3 d4 Fair coin, P(H) = 0.5 d1d2 d3 d4 Markov model HHTHT d1d2 d3 d4 d5 Representing generative models

p d1d2 d3 d4 P(H) = p s1s2 s3 s4 HHTHT d1d2 d3 d4 Hidden Markov model d1d2 d3 d4 d5 Models with latent structure • Not all nodes in a graphical model need to be observed • Some variables reflect latent structure, used in generating D but unobserved How do we select the “best” model?

Likelihood Prior probability Posterior probability Bayes’ rule Sum over space of hypotheses

The origin of Bayes’ rule • A simple consequence of using probability to represent degrees of belief • For any two random variables:

Why represent degrees of belief with probabilities? • Good statistics • consistency, and worst-case error bounds. • Cox Axioms • necessary to cohere with common sense • “Dutch Book” + Survival of the Fittest • if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. • Provides a theory of incremental learning • a common currency for combining prior knowledge and the lessons of experience.

Hypotheses in Bayesian inference • Hypotheses H refer to processes that could have generated the data D • Bayesian inference provides a distribution over these hypotheses, given D • P(D|H) is the probability of D being generated by the process identified by H • Hypotheses H are mutually exclusive: only one process could have generated D

Coin flipping • Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p

Comparing two simple hypotheses • Contrast simple hypotheses: • H1: “fair coin”, P(H) = 0.5 • H2:“always heads”, P(H) = 1.0 • Bayes’ rule: • With two hypotheses, use odds form

Bayes’ rule in odds form = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Prior odds Posterior odds Bayes factor(likelihood ratio)

Data = HHHHH = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D)  30

Data = HHHHHHHHHH = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1P(H2) = 1/1000 P(H1|D) / P(H2|D)  1

Coin flipping • Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p

p d1d2 d3 d4 d1d2 d3 d4 Fair coin, P(H) = 0.5 P(H) = p Comparing simple and complex hypotheses • Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p? vs.

Comparing simple and complex hypotheses • P(H) = p is more complex than P(H) = 0.5 in two ways: • P(H) = 0.5 is a special case of P(H) = p • for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5

Comparing simple and complex hypotheses Probability

Comparing simple and complex hypotheses Probability HHHHH p = 1.0

Comparing simple and complex hypotheses Probability HHTHT p = 0.6

Comparing simple and complex hypotheses • P(H) = p is more complex than P(H) = 0.5 in two ways: • P(H) = 0.5 is a special case of P(H) = p • for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 • How can we deal with this? • frequentist: hypothesis testing • information theorist: minimum description length • Bayesian: just use probability theory!

Likelihood and prior • Likelihood: P(D | p) = pNH(1-p)NT • NH: number of heads • NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 ?

A simple method of specifying priors • Imagine some fictitious trials, reflecting a set of previous experiences • strategy often used with neural networks • e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair • In fact, this is a sensible statistical idea...

Likelihood and prior • Likelihood: P(D | p) = pNH(1-p)NT • NH: number of heads • NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 • FH: fictitious observations of heads • FT: fictitious observations of tails Beta(FH,FT) (pseudo-counts)

Posterior / prior x likelihood • Prior • Likelihood • Posterior Same form!

Conjugate priors • Exist for many standard distributions • formula for exponential family conjugacy • Define prior in terms of fictitious observations • Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000

Normalizing constants • Prior • Normalizing constant for Beta distribution • Posterior • Hence marginal likelihood is

Marginal likelihood for H1 and H2 Probability Marginal likelihood is an average over all values of p

Sensitivity to hyper-parameters

Bayesian model selection • Simple and complex hypotheses can be compared directly using Bayes’ rule • requires summing over latent variables • Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” • Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)

Example: Belgian euro-coins • A Belgian euro spun N=250 times came up heads X=140. • “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002) Source: Mackay exercise 3.15

Classical hypothesis testing • Null hypothesis H0 eg. q = 0.5 (unbiased coin) • For classical analysis, don’t need to specify alternative hypothesis, but later we will useH1:  0.5 • Need a decision rule that maps data D to accept/ reject of H0. • Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2

P-values • Define p-value of threshold  as • Intuitively, p-value of data is probability of getting data at least that extreme given H0

P-values R • Define p-value of threshold  as • Intuitively, p-value of data is probability of getting data at least that extreme given H0 • Usually choose  so that false rejection rate of H0 is below significance level  = 0.05

P-values R • Define p-value of threshold  as • Intuitively, p-value of data is probability of getting data at least that extreme given H0 • Usually choose  so that false rejection rate of H0 is below significance level  = 0.05 • Often use asymptotic approximation to distribution of d(D) under H0 as N !1

P-value for euro coins • N = 250 trials, X=140 heads • P-value is “less than 7%” • If N=250 and X=141, pval = 0.0497, so we can reject the null hypothesis at the significance level of 5%. • This does not mean P(H0|D)=0.07! Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)

Bayesian analysis of euro-coin • Assume P(H0)=P(H1)=0.5 • Assume P(p) ~ Beta(,) • Setting =1 yields a uniform (non-informative) prior.

Bayesian analysis of euro-coin • If =1,so H0 (unbiased) is (slightly) more probable than H1 (biased). • By varying  over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis. • Other priors yield similar results. • Bayesian analysis contradicts classical analysis.

Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach? • Violates likelihood principle • Violates stopping rule principle • Violates common sense

Kevin Murphy UBC CS & Stats 9 February 2005