240 likes | 271 Views
Explore Occam's razor principle, Copernican vs. Ptolemaic views, Bayesian model comparison, and Occam factor in statistical models. Understand how simplicity influences theory selection and data modeling. Learn to distinguish between competing models using Bayesian techniques.
E N D
Occam’s razor • "All things being equal, the simplest solution tends to be the best one," or alternately, "the simplest explanation tends to be the right one." In other words, when multiple competing theories are equal in other respects, the principle recommends selecting the theory that introduces the fewest assumptions and postulates the fewest hypothetical entities. It is in this sense that Occam's razor is usually understood. • Wikipedia
Copernican versus Ptolemaic View of the Universe • Copernicus proposed a model of the solar system in which the earth revolved around the sun. Ptolemy (around 1000 years earlier) had proposed a theory of the universe in which planetary bodies revolved around the earth – he used ‘epicycles’ to explain his theory. • Copernicus’s theory ‘won’ because it was a simpler framework from which to explain astronomical motion. Epicycles also ‘explained’ astronomical motion but employed an unnecessarily complex framework which could not properly predict such things.
Boxes behind a tree • In the figure, is there 1 or 2 boxes behind the tree? A one box theory: does not assume many complicated factors such as that two boxes happen to be of identical height, but does explain the data as we see it. A two-box theory: does assume an unlikely factor, but also explains the data as we see it.
Statistical Models • Statistical models are designed to describe data by postulating that data X=(x1,...,xn) follow a density f(X|Θ) in a class described by those in a class {f(X|Θ): Θ} (possibly nonparametric). For a given parameter Θ0, We can compare the likelihood of data values X01 vs X02 via f(X01|Θ0)/f(X02|Θ0). If this is >1, then the first datum is more likely; if <1, then the second is more likely.
Bayesian Model Comparison • We evaluate statistical models via: • The term ‘P(X|M)’ is the likelihood; the term P(M) is the prior; the term ‘P(X)’ is the marginal density of the data. • When comparing two models M1 and M2, we need only look at the ratio,
Bayesian Model Comparison (continued) • In comparing the two models, the term • ‘P(X|M)’ explains how well model M explains the data. We see in our tree example that both the one box and two box theories explain the data well. So they don’t help us decide between the one and two box models. But, the probability, ‘P(M1)’ for the one-box theory is much larger than the probability ‘P(M2)’ . So we prefer the one-box to the two box theory. Note that things like the MLE have no preference regarding the one versus two-box theory.
Model Comparison when parameters are present • If parameters Θ are present, we want to use: • This is the average score of the data. • Calculus shows (see the appendix) that,
The Occam factor • Now, if we had two models M1,M2 which explained the data equally well, but the first provided more certain (posterior) information than the second, we prefer the first model to the second. The ‘likelihood’ scores are similar for the two models; the ‘Occam factor’ ‘(Θ|M)Σ(1/2)’ or posterior uncertainty for the first model is smaller than that for the second.
Example of Model Comparison when Parameters are present • Say we want to choose between two regression models for a set of bivariate data. The first is a linear model and the second is a polynomial model involving terms up to the fourth power. The second always does a better job of fitting the data than the first. But the posterior uncertainty of the second tends to be smaller than that of the second because the presence of additional parameters adds posterior uncertainty. Note that classical statistics always views the second as better than the first.
An example • The data: (-8,8),(-2,10),(6,11) (see next) • The model under : • H0: y=β0+ε; H1: y= β0+β1x+ε; • Parameters have simple gaussian priors and σe=1. • Score[0]=φ{√3 σY}φ{Y} (1/√3)=1.5*10-23; • Score[1]=φ{√3 σY √(1-ρ2)} φ{b0} φ{b1}(1/[3σX]) =.71*10-24 • Score(1)/Score(0)= .71/15 = .05
Example Explained H0 • Score[0]=φ{√3 σY}φ{Y} (1/√3) =1.5*10-23; • Y is the average of the Y’s. φ is the gaussian density. • φ{√3 σY} is the likelihood under the null model (with MLE assignment) • φ{Y} is the prior under the null model (with MLE assignment) • (1/√3) is the inverse of the square root of the information.
Example explained H1 • Score[1]=φ{√3 σY √(1-ρ2)} φ{b0} φ{b1} (1/σX) • φ{√3 σY √(1-ρ2)} is the likelihood under the alternative model (under MLE assignment) • φ{b0} φ{b1} is the prior under the alternative model (under MLE assignment) • b0, b1 are the usual beta estimates. • (1/3σX) is the inverse of the square root of the information.
Classical Statistics falls short • Comparing the likelihoods (under MLE’s) without regards to the Occam factor gives: • Classical Null Score= φ{√3 σY}=.012; • Classical Alt Score= φ{√3 σY √(1-ρ2)} =.3146 • In this case, the alternate model is to be preferred. But, we can see from the picture it isn’t too good, and adds more complexity which doesn’t serve a good purpose.
Stats for the linear model • σx= 7.02; σy= 1.53; mean(y)=9.66; mean(x)=-1.33 • b0= 9.9459 • b1= 0.2095 • BINT = b conf • 5.5683 14.3236 • -0.5340 0.9530 • R = residuals • -0.2703 • 0.4730 • -0.2027 • RINT =residual conf • -3.7044 3.1638 • -5.5367 6.4827 • -2.7783 2.3729 • STATS = • R2= 0.9276 F= 12.8133 p0= 0.1734 p1=0.3378
Dice Example • We roll a die 30 times getting [4,4,3,3,7,9]. Is it a fair die? Would you be willing to gamble using it? H0: p1=…=p6=(1/6); H1: p’s ≈Dir(1,…,1) What does chi-squared goodness of fit say? Chi-square p-value is 31% -- we would never reject the null in this case. What does Bayes theory say: score under H0 is
Dice Example (continued) • Under the alternative: • In this case, the laplace approximation is slightly off. The real answer is 3*10-6 • So, roughly, the alternative is about 10 times as likely as the null. This is in accord with our intuition.
Possible Project • Possible Project: Construct or otherwise get bivariate data which are essentially linearly related with noise. Assume linear and higher power models have equal prior probability. Calculate the average score for linear and higher order models. Show the average score for the linear model is best.
Another Possible Project Generate multinomial data from a distribution with equal p’s. For the generated data determine the chi-squared p-value and compare it to the Bayes factor favoring the null (true) hypothesis – determine how the chi-squared values differ from the Bayes factor counterparts over many simulations.
Appendix: Laplace approximation • In the usual setting,
Possible Project • Fill in the mathematical steps involving the calculation of the marginal distribution of the data and compare it to the laplace approximation.