• 1.16k likes • 1.39k Views
Statistical inference for astrophysics. A short course for astronomers Cardiff University 16-17 Feb 2005 Graham Woan, University of Glasgow. Lecture Plan. Why statistical astronomy? What is probability? Estimating parameters values the Bayesian way the Frequentist way
E N D
Statistical inference for astrophysics A short course for astronomers Cardiff University 16-17 Feb 2005Graham Woan, University of Glasgow Cardiff Feb 16-17 2005
Lecture Plan • Why statistical astronomy? • What is probability? • Estimating parameters values • the Bayesian way • the Frequentist way • Testing hypotheses • the Bayesian way • the Frequentist way • Assigning Bayesian probabilities • Monte-Carlo methods Lectures 1 & 2 Lectures 3 & 4 Cardiff Feb 16-17 2005
Why statistical astronomy? Generally, statistics has got a bad reputation “There are three types of lies: lies, damned lies and statistics” Mark Twain Benjamin Disraeli Often for good reason: Jun 3rd 2004 … two researchers at the University of Girona in Spain, have found that 38% of a sample of papers in Nature contained one or more statistical errors… The Economist Cardiff Feb 16-17 2005
Data Analysis Meat Dishes Soups Why statistical astronomy? Data analysis methods are often regarded as simple recipes… http://www.numerical-recipes.com/ Cardiff Feb 16-17 2005
Why statistical astronomy? Data analysis methods are often regarded as simple recipes… …but in analysis as in life, sometimes the recipes don’t work as you expect. • Low number counts • Distant sources • Correlated ‘residuals’ • Incorrect assumptions • Systematic errors and/or • misleading results Cardiff Feb 16-17 2005
Why statistical astronomy? …and the tools can be clunky: " The trouble is that what we [statisticians] call modern statistics was developed under strong pressure on the part of biologists. As a result, there is practically nothing done by us which is directly applicable to problems of astronomy." Jerzy Neyman, founder of frequentist hypothesis testing. Cardiff Feb 16-17 2005
Why statistical astronomy? For example, we can observe only the one Universe: (From Bennett et al 2003) Cardiff Feb 16-17 2005
The Astrophysicist’s Shopping List We want tools capable of: • dealing with very faint sources • handling very large data sets • correcting for selection effects • diagnosing systematic errors • avoiding unnecessary assumptions • estimating parameters and testing models Cardiff Feb 16-17 2005
Why statistical astronomy? Key question: How do we infer properties of the Universe from incomplete and imprecise astronomical data? Our goal: To make the best inference, based on our observed data and any prior knowledge, reserving the right to revise our position if new information comes to light. Let’s come to this problem afresh with an astrophyicist’s eye, and bypass some of the jargon of orthodox statistics, going right back to plain probability: Cardiff Feb 16-17 2005
Right-thinking gentlemen #1 “A decision was wise, even though it led to disastrous consequences, if with the evidence at hand indicated it was the best one to make; and a decision was foolish, even though it led to the happiest possible consequences, if it was unreasonable to expect those consequences”We should do the best with what we have, not what we wished we had. Herodotus, c.500 BC Cardiff Feb 16-17 2005
Right-thinking gentlemen #2 “Probability theory is nothing but common sense reduced to calculation” Pierre-Simon Laplace (1749 – 1827) Cardiff Feb 16-17 2005
Right-thinking gentlemen #3 Occam’s Razor “Frustra fit per plura, quod fieri potest per pauciora.” “It is vain to do with more what can be done with less.” Everything else being equal, we favour models which are simple. William of Occam (1288 – 1348 AD) Cardiff Feb 16-17 2005
The meaning of probability Definitions, algebra and useful distributions Cardiff Feb 16-17 2005
But what is “probability”? • There are three popular interpretations of the word, each with an interesting history: • Probability as a measure of our degree of belief in a statement • Probability as a measure of limiting relative frequency of outcome of a set of identical experiments • Probability as the fraction of favourable (equally likely) possibilities • We will call these the Bayesian, Frequentist and Combinatorial interpretations. • Note there are signs of trouble here: • How do you quantify “degree of belief”? • How do you define “relative frequency” without using ideas of probability? • What does “equally likely” mean? • Thankfully, at some level, all three interpretations agree on the algebra of probability, which we will present in Bayesian terms: Cardiff Feb 16-17 2005
Algebra of (Bayesian) probability • Probability [of a statement, such as “y = 3”, “the mass of a neutron star is 1.4 solar masses” or “it will rain tomorrow”] is a number between 0 and 1, such that • For some statement X, where the bar denotes the negation of the statement -- The Sum Rule • If there are two statements X and Y, then joint probabilitywhere the vertical line denotes the conditional statement “X given Y is true” – The Product Rule Cardiff Feb 16-17 2005
Algebra of (Bayesian) probability Thomas Bayes (1702 – 1761 AD) • From these we can deduce thatwhere “+” denotes “or” and I represents common background information -- The Extended Sum Rule • …and, because p(X,Y)=p(Y,X), we getwhich is calledBayes’ Theorem • Note that these results are also applicablein Frequentist probability theory, with a suitable change in meaning of “p”. Cardiff Feb 16-17 2005
Algebra of (Bayesian) probability We can usually calculate all these terms Bayes’ theorem is the appropriate rule for updating our degree of belief when we have new data: Prior Likelihood Posterior Evidence [note that the word evidence is sometimes used for something else (the ‘log odds’). We will stick to the p(d|I) definition here.] Cardiff Feb 16-17 2005
Algebra of (Bayesian) probability • We can also deduce the marginal probabilities. If X and Y are propositions that can take on values drawn from and then this gives use the probability of X when we don’t care about Y. In these circumstances, Y is known as a nuisance parameter. • All these relationships can be smoothly extended from discrete probabilities to probability densities, e.g.where “p(y)dy” is the probability that y lies in the range y to y+dy. =1 Cardiff Feb 16-17 2005
Algebra of (Bayesian) probability These equations are the key to Bayesian Inference – the methodology upon which (much) astronomical data analysis is now founded. Clear introduction by Devinder Sivia (OUP). (fuller bibliography tomorrow) See also the free book by Praesenjit Saha (QMW, London). http://ankh-morpork.maths.qmw.ac.uk/%7Esaha/book Cardiff Feb 16-17 2005
Example… • A gravitational wave detector may have seen a type II supernova as a burst of gravitational radiation. Burst-like signals can also come from instrumental glitches, and only 1 in 10,000 bursts is really a supernova, so the data are checked for glitches by examining veto channels. The test is expected to confirm the burst is astronomical in 95% of cases in which it truly is, and in 1% when it truly is not.The burst passes the veto test!!What is the probability we have seen a supernova?Answer: DenoteLet I represent the information that the burst seen is typical of those used to deduce the information in the question.Then we are told that: } Prior probabilities for S and G } Likelihoods for S and G Cardiff Feb 16-17 2005
Example cont… • But we want to know the probability that it’s a supernova, given it passed the veto test:By Bayes, theoremand we are directly told everything on the rhs except , the probability that any burst candidate would pass the veto test. If the burst is either a supernova or a hardware glitch then we can marginalise over these alternatives:so 0.9999 0.01 0.0001 0.95 Cardiff Feb 16-17 2005
Example cont… This is the likelihood: how consistent is the data with a particular model? This is the posterior: how probable is the model, given the data? • So, despite passing the test there is only a 1% probability that the burst is a supernova!Veto tests have to be blisteringly good if supernovae are rare. • Why? Because most bursts that pass the test are just instrumental glitches – it really is just common sense reduced to calculation. • Note however that by passing the test, the probability that this burst is from a supernova has increased by a factor of 100 (from 0.0001 to 0.01). • Moral: Probability that a supernova burst gets through the veto (0.95) Probability that it’s a supernova burst if it gets through the veto(0.01) Cardiff Feb 16-17 2005
the basis of frequentist probability Cardiff Feb 16-17 2005
Basis of frequentist probability • Bayesian probability theory is simultaneously a very old and a very young field: • Old : original interpretation of Bernoulli, Bayes, Laplace… • Young: ‘state of the art’ in (astronomical) data analysis • But BPT was rejected for several centuries because probability as degree of belief was seen as too subjective Frequentist approach Cardiff Feb 16-17 2005
Basis of frequentist probability Probability = ‘long run relative frequency’ of an event it appears at first that this can be measured objectively e.g. rolling a die. What is ? If die is ‘fair’ we expect These probabilities are fixed (but unknown) numbers. Can imagine rolling die M times. Number rolled is a random variable – different outcome each time. Cardiff Feb 16-17 2005
Basis of frequentist probability • We define If die is ‘fair’ • But objectivity is an illusion: • assumes each outcome equally likely • (i.e. equally probable) • Also assumes infinite series of identical trials • What can we say about the fairness of the die after • (say) 5 rolls, or 10, or 100 ? Cardiff Feb 16-17 2005
Basis of frequentist probability • In the frequentist approach, a lot of mathematical machinery is specifically defined to let us address frequency questions: • We take the data as a random sample of size M , drawn from an assumed underlying pdf • Sampling distribution, derived from the underlying pdf and M • Define an estimator – function of the sample data that is used to estimate properties of the pdf • But how do we decide what makes an acceptable estimator? Cardiff Feb 16-17 2005
Basis of frequentist probability • Example: measuring a galaxy redshift • Let the true redshift = z0 -- a fixed but unknown parameter. We use two telescopes to estimate z0 and compute sampling distributions for and modelling errors • Small telescope • low dispersion spectrometer • Unbiased: • Repeat observation a • large number of times • average estimate is • equal to z0 • BUT is large due to the low dispersion. Cardiff Feb 16-17 2005
Basis of frequentist probability • Example: measuring a galaxy redshift • Let the true redshift = z0 -- a fixed but unknown parameter. We use two telescopes to estimate z0 and compute sampling distributions for and modelling errors • Large telescope • high dispersion spectrometer • but faulty astronomer! • (e.g. wrong calibration) • Biased: • BUT is small. Which is the better estimator? Cardiff Feb 16-17 2005
Basis of frequentist probability What about the sample mean? • Let be a random sample from pdf with mean • and variance . Then • = sample mean • Can show that -- an unbiased estimator • But bias is defined formally in terms of an infinite set of randomly chosen samples, each of size M. • What can we say with a finite number of samples, each of finite size?Before that… Cardiff Feb 16-17 2005
Some important probability distributions quick definitions Cardiff Feb 16-17 2005
Some important pdfs: discrete case • Poisson pdf • e.g. number of photons / second counted by a CCD, • number of galaxies / degree2 counted by a survey n = number of detections Poisson pdf assumes detections are independent, and there is a constant rate Can show that Cardiff Feb 16-17 2005
Some important pdfs: discrete case • Poisson pdf • e.g. number of photons / second counted by a CCD, • number of galaxies / degree2 counted by a survey Cardiff Feb 16-17 2005
Some important pdfs: discrete case • Binomial pdf • number of ‘successes’ from N observations, for two mutually • exclusive outcomes (‘Heads’ and ‘Tails’) • e.g. number of binary stars, Seyfert galaxies, supernovae… r = number of ‘successes’ = probability of ‘success’ for single observation Can show that Cardiff Feb 16-17 2005
Some important pdfs: continuous case p(x) 1 b-a x a 0 b • Uniform pdf Cardiff Feb 16-17 2005
Some important pdfs: continuous case • Central, or Normal pdf • (also known as Gaussian ) p(x) x Cardiff Feb 16-17 2005
Cumulative distribution function (CDF) = Prob( x < a ) Cardiff Feb 16-17 2005
Measures and moments of a pdf The nth moment of a pdf is defined as Discrete case Continuous case Cardiff Feb 16-17 2005
Measures and moments of a pdf The 1stmoment is called the meanorexpectation value Discrete case Continuous case Cardiff Feb 16-17 2005
Measures and moments of a pdf The 2ndmoment is called the mean square Discrete case Continuous case Cardiff Feb 16-17 2005
Measures and moments of a pdf The variance is defined as and is often written . is called the standard deviation Discrete case Continuous case In general Cardiff Feb 16-17 2005
Measures and moments of a pdf pdf mean variance Poisson Binomial Uniform Normal discrete continuous Cardiff Feb 16-17 2005
Measures and moments of a pdf The Median divides the CDF into two equal halves Prob( x < xmed) = Prob( x > xmed) = 0.5 Cardiff Feb 16-17 2005
Measures and moments of a pdf The Mode is the value of x for which the pdf is a maximum p(x) x For a Normal pdf, mean = median = mode = Cardiff Feb 16-17 2005
Parameter estimation The Bayesian way Cardiff Feb 16-17 2005
Bayesian parameter estimation In the Bayesian approach, we can test our model, in the light of our data (i.e. rolling the die) and see how our degree of belief in its ‘fairness’ evolves, for any sample size, considering only the data that we did actually observe. Prior Likelihood Posterior Influence of our observations What we knew before What we know now Cardiff Feb 16-17 2005
Bayesian parameter estimation Astronomical example #1: Probability that a galaxy is a Seyfert 1 We want to know the fraction of Seyfert galaxies that are type 1. How large a sample do we need to reliably measure this? Model as a binomial pdf: = global fraction of Seyfert 1s Suppose we sample N Seyferts, and observe r Seyfert 1s probability of obtaining observed data, given model – the likelihood of Cardiff Feb 16-17 2005
Bayesian parameter estimation • But what do we choose as our prior? This has been the source of much argument between Bayesians and frequentists, though it is often not that important. • We can sidestep this for a moment, realising that if our data are good enough, the choice of prior shouldn’t matter! Prior Likelihood Posterior Dominates Cardiff Feb 16-17 2005
Bayesian parameter estimation Consider a simulation of this problem using two different priors Flat prior; all values of equally probable p( |I ) Normal prior; peaked at = 0.5 Cardiff Feb 16-17 2005
After observing 0 galaxies p( | data, I ) Cardiff Feb 16-17 2005