1 / 68

Kevin Murphy UBC CS & Stats 9 February 2005

Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful. Kevin Murphy UBC CS & Stats 9 February 2005. Where does the title come from?. “Why I am not a Bayesian”, Glymour, 1981 “Why Glymour is a Bayesian”, Rosenkrantz, 1983

wanda-beard
Download Presentation

Kevin Murphy UBC CS & Stats 9 February 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why I am a Bayesian(and why you should become one, too)orClassical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005

  2. Where does the title come from? • “Why I am not a Bayesian”, Glymour, 1981 • “Why Glymour is a Bayesian”, Rosenkrantz, 1983 • “Why isn’t everyone a Bayesian?”,Efron, 1986 • “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001 Many other such philosophical essays…

  3. Prob = objective relative frequencies Params are fixed unknown constants, so cannot write e.g. P(=0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Frequentist vs Bayesian Source: “All of statistics”, Larry Wasserman

  4. Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?

  5. Coin flipping HHTHT HHHHH What process produced these sequences? The following slides are from Tenenbaum & Griffiths

  6. statistical models Hypotheses in coin flipping Describe processes by which D could be generated • Fair coin, P(H) = 0.5 • Coin with P(H) = p • Markov model • Hidden Markov model • ... D = HHTHT

  7. generative models Hypotheses in coin flipping Describe processes by which D could be generated • Fair coin, P(H) = 0.5 • Coin with P(H) = p • Markov model • Hidden Markov model • ... D = HHTHT

  8. Graphical model notation Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation d1d2 d3 d4 Fair coin, P(H) = 0.5 d1d2 d3 d4 Markov model HHTHT d1d2 d3 d4 d5 Representing generative models

  9. p d1d2 d3 d4 P(H) = p s1s2 s3 s4 HHTHT d1d2 d3 d4 Hidden Markov model d1d2 d3 d4 d5 Models with latent structure • Not all nodes in a graphical model need to be observed • Some variables reflect latent structure, used in generating D but unobserved How do we select the “best” model?

  10. Likelihood Prior probability Posterior probability Bayes’ rule Sum over space of hypotheses

  11. The origin of Bayes’ rule • A simple consequence of using probability to represent degrees of belief • For any two random variables:

  12. Why represent degrees of belief with probabilities? • Good statistics • consistency, and worst-case error bounds. • Cox Axioms • necessary to cohere with common sense • “Dutch Book” + Survival of the Fittest • if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. • Provides a theory of incremental learning • a common currency for combining prior knowledge and the lessons of experience.

  13. Hypotheses in Bayesian inference • Hypotheses H refer to processes that could have generated the data D • Bayesian inference provides a distribution over these hypotheses, given D • P(D|H) is the probability of D being generated by the process identified by H • Hypotheses H are mutually exclusive: only one process could have generated D

  14. Coin flipping • Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p

  15. Coin flipping • Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p

  16. Comparing two simple hypotheses • Contrast simple hypotheses: • H1: “fair coin”, P(H) = 0.5 • H2:“always heads”, P(H) = 1.0 • Bayes’ rule: • With two hypotheses, use odds form

  17. Bayes’ rule in odds form = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Prior odds Posterior odds Bayes factor(likelihood ratio)

  18. Data = HHTHT = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000 P(H1|D) / P(H2|D) = infinity

  19. Data = HHHHH = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D)  30

  20. Data = HHHHHHHHHH = x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1P(H2) = 1/1000 P(H1|D) / P(H2|D)  1

  21. Coin flipping • Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p

  22. p d1d2 d3 d4 d1d2 d3 d4 Fair coin, P(H) = 0.5 P(H) = p Comparing simple and complex hypotheses • Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p? vs.

  23. Comparing simple and complex hypotheses • P(H) = p is more complex than P(H) = 0.5 in two ways: • P(H) = 0.5 is a special case of P(H) = p • for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5

  24. Comparing simple and complex hypotheses Probability

  25. Comparing simple and complex hypotheses Probability HHHHH p = 1.0

  26. Comparing simple and complex hypotheses Probability HHTHT p = 0.6

  27. Comparing simple and complex hypotheses • P(H) = p is more complex than P(H) = 0.5 in two ways: • P(H) = 0.5 is a special case of P(H) = p • for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 • How can we deal with this? • frequentist: hypothesis testing • information theorist: minimum description length • Bayesian: just use probability theory!

  28. Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x

  29. Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Marginal likelihood likelihood Prior

  30. Likelihood and prior • Likelihood: P(D | p) = pNH(1-p)NT • NH: number of heads • NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 ?

  31. A simple method of specifying priors • Imagine some fictitious trials, reflecting a set of previous experiences • strategy often used with neural networks • e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair • In fact, this is a sensible statistical idea...

  32. Likelihood and prior • Likelihood: P(D | p) = pNH(1-p)NT • NH: number of heads • NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 • FH: fictitious observations of heads • FT: fictitious observations of tails Beta(FH,FT) (pseudo-counts)

  33. Posterior / prior x likelihood • Prior • Likelihood • Posterior Same form!

  34. Conjugate priors • Exist for many standard distributions • formula for exponential family conjugacy • Define prior in terms of fictitious observations • Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000

  35. Normalizing constants • Prior • Normalizing constant for Beta distribution • Posterior • Hence marginal likelihood is

  36. Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Likelihood for H1 Marginal likelihood (“evidence”) for H2

  37. Marginal likelihood for H1 and H2 Probability Marginal likelihood is an average over all values of p

  38. Sensitivity to hyper-parameters

  39. Bayesian model selection • Simple and complex hypotheses can be compared directly using Bayes’ rule • requires summing over latent variables • Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” • Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)

  40. Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?

  41. Example: Belgian euro-coins • A Belgian euro spun N=250 times came up heads X=140. • “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002) Source: Mackay exercise 3.15

  42. Classical hypothesis testing • Null hypothesis H0 eg. q = 0.5 (unbiased coin) • For classical analysis, don’t need to specify alternative hypothesis, but later we will useH1:  0.5 • Need a decision rule that maps data D to accept/ reject of H0. • Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2

  43. P-values • Define p-value of threshold  as • Intuitively, p-value of data is probability of getting data at least that extreme given H0

  44. P-values R • Define p-value of threshold  as • Intuitively, p-value of data is probability of getting data at least that extreme given H0 • Usually choose  so that false rejection rate of H0 is below significance level  = 0.05

  45. P-values R • Define p-value of threshold  as • Intuitively, p-value of data is probability of getting data at least that extreme given H0 • Usually choose  so that false rejection rate of H0 is below significance level  = 0.05 • Often use asymptotic approximation to distribution of d(D) under H0 as N !1

  46. P-value for euro coins • N = 250 trials, X=140 heads • P-value is “less than 7%” • If N=250 and X=141, pval = 0.0497, so we can reject the null hypothesis at the significance level of 5%. • This does not mean P(H0|D)=0.07! Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)

  47. Bayesian analysis of euro-coin • Assume P(H0)=P(H1)=0.5 • Assume P(p) ~ Beta(,) • Setting =1 yields a uniform (non-informative) prior.

  48. Bayesian analysis of euro-coin • If =1,so H0 (unbiased) is (slightly) more probable than H1 (biased). • By varying  over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis. • Other priors yield similar results. • Bayesian analysis contradicts classical analysis.

  49. Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?

  50. Outline • Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach? • Violates likelihood principle • Violates stopping rule principle • Violates common sense

More Related