Prior distribution

Prior distribution “All models are wrong…but some models are useful.” G.E.P. Box

Principles of Prior Selection • The most important principle of prior selection is that your prior should represent the best knowledge that you have about the parameters of the problem before you look at the data • Usually there is some information at your disposal. • You know that the distance to the Moon is more than 10,000 km and less than 1,000,000 km, so you are justified in setting your prior to zero outside that range • The population of fish in a lake cannot be greater than if the volume of the lake were entirely filled with fish • It is unjustified to use default, ignorance, or other automatic priors if you have substantial information that can affect the answer

Principles of Prior Selection • But sometimes you do not have substantial prior information! • A number of principles have been used to do this: • Group invariance arguments, • Maximum entropy arguments, • arguments from the Fisher information matrix • These “ignorance” priors generally reproduce the results of a corresponding frequentist analysis (with of course a Bayesian interpretation), so results using them cannot be worse than a frequentist analysis. • Of course, if you have real information, you can use this information in a way that frequentists can’t.

Group Invariance • Here the idea is that if we know a priori that our prior knowledge should be invariant to the action of some underlying group, then we can choose our prior to respect that invariance • Example: In rolling a die or flipping a coin, if we have no reason to think that any side is favored then we are saying that our state of knowledge about the roll of the die/flip of the coin is invariant to the action of the permutation group on 2 (or 6) objects. Thus, if we were to mix up the sides of the die by applying an arbitrary element of the permutation group to the die or coin, we would not change our prior. • The prior is constant on each side of the die/coin

Group Invariance • Other examples • If we have an angular parameter, we may find that our prior is invariant to the action of the rotation group. Then our prior would be constant in angle. • If we have a pair of angular parameters (e.g., latitude, longitude) and believe that our prior is invariant to arbitrary space rotations (O(3)) , then our prior will be constant on constant solid angles, e.g., dcos d. • If our prior represents a location and is invariant to translations of the origin of the axes, then our prior is the (improper) flat prior. Such invariance may be appropriate if physical considerations indicate such invariance.

Group Invariance • Other examples • A more interesting example is if we are measuring a positive quantity such as a length, and our prior says that it should be invariant with respect to changes in the scale of the graduations of the ruler (e.g., it shouldn’t matter if our ruler is graduated in inches or centimeters, our results should be physically the same). • Then the prior is the (improper) Jeffreys prior 1/. • This is the same as saying that the prior on =log  is flat (translations in log  are equivalent to multiplications by an arbitrary scale factor)

Group Invariance • An example of the usefulness of this Jeffreys prior is given by Benford’s Law (misnamed—it was actually discovered by the astronomer Simon Newcomb). • If you consider a collection of measurements of some quantity that is measured on an arbitrary multiplicative scale, such as areas of political divisions, then the distribution of the first digits of these numbers is empirically found to be roughly logarithmic

Group Invariance • Example: Areas of 40 European countries (some units) Digit Actual Predicted 1 10 (25%) 30% 2 7 (17.5%) 18% 3 6 (15%) 12% 4 6 (15%) 10% 5 3 (7.5%) 8% 6 2 (5%) 7% 7 2 (5%) 6% 8 1 (2.5%) 5% 9 3 (5%) 5%

Group Invariance • This does not indicate that Mother Nature has ten fingers! • We would expect that the distribution of first digits should not depend on what units we use, e.g., square km, hectares, square feet, whatever. Our knowledge is scale invariant. • Distribution should be invariant to transformations of the scale (renormalization) group c. So

Group Invariance • Let a=10k be the lower endpoint of a decade of numbers. The relative proportion of numbers with first digit d in each decade (independent of decade) is given by • Evaluate this for all 9 first digits and add to get the normalization constant: • The expected frequency of first digits d expected is therefore

Group Invariance • These examples illustrate that the existence of a natural symmetry group generates a natural prior, which is the distribution that is invariant under the action of the group. • When the group is compact this is the same as the (unique) Haar measure on the compact group • When the group is not compact, the prior will be improper and can only be used if the posterior is proper. The prior will only be known up to a factor in this case. • Here we are ignoring the “ends” of the scale, e.g., the size of the universe, the size of the Earth, etc. The symmetry group is only approximate! • Some difficulties arise because there can be two different invariant measures on noncompact groups

Group Invariance • Things do get more complex when the group is not compact. As an example, consider the affine group (group of scale changes and offsets) with the multiplication law • This is easily verified to be a (non-Abelian) group: • The product of any two elements of the group is a new element of the group • There is an identity • Each element has an inverse • Associative law is satisfied

Group Invariance • Consider in particular a location parameter x and a scale parameter . Consider transforming them by a fixed element (l,s) of the affine group by multiplication on the right. We get • The infinitesimal volume element transforms thusly:so that the invariant volume element (measure) is

Group Invariance • This is the right-invariant Haar measure on the affine group. It can be taken as an appropriate prior on x and . • Show that the left-invariant Haar measure on the affine group is • Hint: Work from the multiplication lawwhere we have multiplied by the constant element on the left.

Group Invariance • There is some controversy as to whether the left or the right invariant Haar measure is the correct one to use as a prior. Berger favors the right invariant measure and Villegas the left invariant measure. • Berger says that use of the right invariant Haar measure avoids certain “marginalization paradoxes” as well as giving improved results in other situations • We note as well that Berger’s choice yields a prior that is the same as what we get if we multiply the flat prior on x with the Jeffreys prior on . That would be appropriate if we believed that x and  were independent.

Maximum Entropy • Another approach to prior selection suggested by the late E.T. Jaynes is to maximize the informationentropy of the distribution, subject to constraints that you may happen to know. • The entropy is supposed to be a measure of how much information we lack to describe the distribution. That is, the larger the entropy, the less we know and can specify about the distribution

Maximum Entropy • Example: Suppose we wish to specify a binary number of n digits, and each number is equiprobable. • It takes exactly n binary bits to specify the number. • There are N numbers, running from 0 to 2n-1. Note that n = log2(N) • If we double the number of bits, we square the number of numbers that we can specify. • So the amount of missing information we could obtain when we learn the number is proportional to the number of bits it takes us to specify the missing information.

Maximum Entropy • Similarly, if we have two sets of numbers of size M and N, and it takes m and n bits respectively to specify a given number in each set, • Then, where k is an arbitrary constant • This suggests that information should be logarithmic in the number of equiprobable cases

Maximum Entropy • It is evident that we obtain more information if we learn that an improbable result is true than if we learn that a probable result is true. This suggests that information should be a function of probability: H(p1,p2,…,pn) • Note that if we have M equiprobable cases, pj=1/M and Think of this as the expectedentropy for case j: The entropy–k log pjtimes its probability pj

Maximum Entropy • Shannon proposed that a reasonable definition of the information potentially available in observing events of unequal probability pj would be • This is consistent with the equiprobable case, and also gives greater potential information gain to events of lower probability since then the entropy of a low probability case is –k log pj which goes to infinity as pj goes to 0. But this happens only with probability pj, so the expected entropy from this case is –kpj log pj

Maximum Entropy • Example: Suppose we have 8 equiprobable cases. Then pj=1/8. Under the definition (taking k=1), • Suppose we divide the 8 cases into c1={1,2} and c2={3,4,5,6,7,8} and suppose we learn first that one of these cases is true and then learn which of the subcases is true. I.e., we may learn that c1={1,2} is true, and then learn that of these possibilities 2 is true. The information gained is exactly equivalent to learning outright which of the 8 cases is true, so if our proposal makes sense we should be able to make this all consistent. This suggests

Maximum Entropy p(c1) p(1|c1) • Here’s why: Informationupon learningwhich of 2 casesin c1 is true, giventhat c1 is true (butonly observedwith probability 1/4) so expectedinformation gainis 1/4 of total Informationupon learninganswer outright Informationupon learningwhich of 6 casesin c2 is true, given that c2 is true (butonly observedwith probability3/4)... Informationupon learningwhich of c1 orc2 is true

Maximum Entropy • Substituting into the proposed equation we find • Direct evaluation shows that this agrees

Maximum Entropy • Thus we generalize: • H(p1, p2)= –p1log p1 –p2log p2 • To • H(p1, p2,.., pN)= –pklog pk • To (for the continuous case)

Maximum Entropy • Examples: Suppose we have a finite state space with n possibilities, and have no additional knowledge. The principle of Maximum Entropy (MAXENT) suggests maximizing the information entropy, subject to the side constraint that the total probability be equal to 1. We can do this by introducing a Lagrange multiplier l:

Maximum Entropy • Examples: What is the continuous distribution that maximizes the information entropy, given that the mean and variance are known?

Maximum Entropy • Introduce Lagrange multipliers a, b, c: • Taking the variation p(x), • Thus the distribution is Normal, and applying the constraints it is N(,2)

Maximum Entropy • Thus the normal distribution is the one that maximizes our uncertainty, given fixed mean and variance. This is one more reason why the normal distribution is so important: It tells us less about the data, given only that we know the mean and variance, than any other continuous distribution. • The information entropy of the normal distribution can be calculated to be • Show this!

White die Red die N p N p 1 3246 0.16230 3407 0.17035 2 3449 0.17245 3631 0.18155 3 2897 0.14485 3176 0.15880 4 2841 0.14205 2916 0.14580 5 3635 0.18175 3448 0.17240 6 3932 0.19660 3422 0.17110 Maximum Entropy • An interesting example of using maximum entropy was given by Jaynes: The astronomer Wolf (the same one who invented sunspot numbers) had a pair of dice that he tossed repeatedly over the years, recording the outcomes. He obtained

Maximum Entropy • The dice do not appear to be fair, and the white and red die appear to be unfair in different ways. • Jaynes proposed the following physical causes: • The excavations for the spots made larger numbers lighter than smaller ones, causing them to come up more frequently • One axis was longer than the other two, causing the faces on its ends to be seen less frequently • (On the white die) there might have been a chip on the 2-3-6 corner, making these three faces come up more frequently

Maximum Entropy • Jaynes maximized the information entropy, subject to constraints that correspond to these physical situations—for example, proposing that the deviation in the frequency that a face would come up, relative to a fair die, was proportional to (T–B) where T is the number of spots on the top and B the number of spots on the bottom, when a particular face is up. The analysis also predicted that both dies were manufactured so that they were longer along the 3-4 axis • Much later, the actual dice were found in the archives of Wolf’s observatory. Measurement of the dice confirms Jaynes’ analysis of the physical characteristics of the dice, including the suspected chip

Maximum Entropy • Example of how Jaynes did this: Maximize the entropysubject to the constraints • (For a fair die the sum would be 3.5)

Maximum Entropy • Solution: Introduce Lagrange and multipliers  and . Find an extremum ofwith solutionfor small . Note that for a fair die all the p’s are equal so  would be zero. Thus we get a linear equation for  and I get =0.0382. The exact solution is =0.03373 • Calculate the probabilities implied by this analysis of the white die. Compare with the actual probabilities

Maximum Entropy • The constraint implied by the “long axis” proposal would increase the frequencies of 1, 2, 5 and 6 by an amount , while decreasing the frequencies of 3 and 4 by an amount 2. (The factor 2 comes in to maintain the normalization) • This corresponds to a constraint • What probabilities are obtained with both constraints on the white die? How do they agree with the observed frequencies?

Jeffreys Priors • Harold Jeffreys proposed another general procedure for picking priors. He suggested usingwhereis the Fisher information matrix of p(x|). Here,  may be a vector of parameters, and the expectation is taken over x.

Jeffreys Priors • The Jeffreys prior has the very nice property that it is invariant to parameterization changes:

Jeffreys Priors • Example: x1, x2, …, xn ~ N(, 2) where  is unknown but the variance 2 is known. Then

Jeffreys Priors • Example: x1, x2, …, xn ~ N(, 2) where  is known but the variance 2 is unknown. Then

Jeffreys Priors • Example: x1, x2, …, xn ~ N(, 2) where both  and 2 are unknown. Then

Jeffreys Priors • Example: x1, x2, …, xn ~ N(, 2) where both  and 2 are unknown. This leads to the prior on  and  • This is not what we would have expected, if we thought that the priors on  and  should be independent! (It is what we would get if we used the left-invariant Haar measure, but this is rejected for other reasons).

Jeffreys Priors • Jeffreys himself thought this result inconsistent with the previous two, and preferred the prior obtained by assuming independence of  and  (the right-invariant Haar prior) • This is the independence Jeffreys prior for this problem, and it is the one that we shall use (following Berger)

Jeffreys Priors • Discussion: Choosing “ignorance” priors is by no means easy or straightforward. • Arguments based on group symmetry seem the most secure from a logical point of view but depend on there being a natural symmetry • Maximum entropy is appealing, however the priors that result are not invariant to a change of variable, so they depend implicitly on a (subjective) judgement about a natural parameterization of the problem

Jeffreys Priors • Jeffreys priors are also suspect as the posterior distribution depends on the form of the data and thus may violate the LP • In particular, the Jeffreys priors for binomial and negative binomial data are different! • Finally, use of such priors will not take into account real prior information that we may have, so again we emphasize that if you have information you should use it and not one of these “uninformative” or “automatic” priors

Informed Vague Priors • Sometimes one may have real information that can lead to a prior, but the prior will still be “vague”, or spread out. • Example: If the sun were surrounded by a spherically symmetric distribution of stars then the number of stars in a shell of width dr is proportional to r2dr. If we were estimating the distance to a star by some means, it would be appropriate to use this as a prior on the distance. • For many years astronomers failed to recognize this and thus a bias was built into the stellar distance scale, the so-called Lutz-Kelker bias, though Trumpler and Weaver were apparently aware of it in the 1930’s. The stars are on average farther away than their measured distances would indicate

Informed Vague Priors • In the galaxy, the density of some groups of stars falls off roughly exponentially with distance from the galactic plane. This suggests a prior of the formwhere  is the star’s latitude and z0 is the scale height for the exponential falloff of density for the group of stars in question

Conjugate priors • Sometimes constructing noninformed priors can be difficult. • We might not have any physical information either to help us choose an informed prior. • In such cases, we prefer analytical convenience and choose a conjugate prior. • Let F be a sampling distribution, and P the class of prior distributions. • P is a natural conjugate prior for F if

Conjugate priors • Computational convenience • They can be interpreted as additional data

Prior distribution

Prior distribution

Presentation Transcript

Prior Authorization

Julie Prior

PRIOR RESTRAINT

Prior Knowledge

Access Prior Knowledge

Prior Knowledge!

Prior Knowledge

Prior Knowledge

PRIOR REVIEW AND PRIOR RESTRAINT

Esmé Prior

Prior Art

Prior planning

Probabilities Probability Distribution Predictor Variables Prior Information New Data

Prior Knowledge

Prior knowledge

Prior Situation

Elicitation of Expert Opinion as a Prior Distribution

Prior Art

Prior steps :

Prior Appropriation WR

Prior Solution

Prior Knowledge