260 likes | 270 Views
This lecture discusses objective functions for model fitting, specifically the sum of squared residuals. Topics include least squares method, likelihood, hypothesis testing, and different ways to handle homogeneous and non-homogeneous noise. Examples and applications are provided.
E N D
Lecture 6 • Objective functions for model fitting: • Sum of squared residuals (=> the ‘method of least squares’). • Likelihood • Hypothesis testing
Model fitting – reminder of the terminology: • We have datayi at samples of some independent variable xi. • The model is our estimate of the parent or truth function. • Let’s express the model m(xi) as a function of a few parametersθ1, θ2 .. θM. • Finding the ‘best fit’ model then just means best estimates of the θ. (Bold – shorthand for a list) • Knowledge of physics informs choice of m, θ. The parent function – what we’d like to find out (but never can, exactly).
Naive best fit calculation: • The residuals for a particular model = yi-mi. • To ‘thread the model through the middle of the noise’, we want the magnitudes of all residuals to be small. • A reasonable way (not the only way) to achieve this is to define a sum of squared residuals as our objective function: • Fitting by minimizing this objective function is called the method of least squares. It is extremely common. • NOTE! This approach IGNORES possible uncerts in x.
But what if the noise is not homogeneous? • Some bits clearly have higher σ than others. • Answer: weight by 1/σ2i • This form of U is sometimes called χ2(pronounced kai squared). • To use it, we need to know the σi.
Simple example: mi = θ1 + θ2si Model – red is si, green the flat background. Contour map of Uls. The data yi: Truth values!
An even simpler example: • Last lecture, I noted that there do exist cases in which we can directly invert • For least squares, this happens if the model is a polynomial function of the parameters θi. • Expansion of grad U in this case gives a set of M linear equations in the M parameters called the normal equations. • It is easy to solve these to get the θi.
Simplest example of all: fitting a straight line. • Called linear regression by the statisticians. • There is a huge amount of literature on it. • Normal equations for a line model turn out to be: • Polynomial is an easy extension to this.
χ2 for Poisson data – possible, but problematic. • Choose data yi as estimator for σi2? • No - can have zero values in denominator. • Choose (evolving) model as estimator for σi2? • No - gives a biased result. • Better: Mighell formula • Unbiased, but no good for goodness-of-fit. • Use Mighell to fit θ then standard U for “goodness of fit” (GOF). Mighell K J, Ap J 518, 380 (1999)
Another choice of U: likelihood. • Likelihood is best illustrated by Poisson data. • Consider a single Poisson random variable y: its PDF is where m here plays the role of the expectation value of y. • We’re used to thinking of this as a function just of one variable, ie y; • but it is really a function of both y and m.
PDF for y vs likelihood for θ. Probability p(y|θ) = θye–θ / y! Likelihood p(y|θ) = θye–θ / y!
The likelihood function. • Before, we thought “given m, let us apply the PDF to obtain the probability of getting between y and y+dy.” • Now we are saying “well we know y, we just measured it. We don’t know m. But surely the PDF taken as a function of mindicates the probability density for m.” • Problems with this: • Likelihood function is not necessarily normalized, like a ‘proper’ PDF; • What assurance do we have that the true PDF for m has this shape??
Likelihood continued. • Usually we have many (N) samples yi. Can we arrive at a single likelihood for all samples taken together? • (Note that we’ve stopped talking just about Poisson data now – this expression is valid for any form of p.) • Sometimes easier to deal with the log-likelihoodL:
Likelihood continued • To get the best-fit model m, we need to maximize the likelihood (or equivalently, the log likelihood). • If we want an objective function to minimize, it is convenient to choose –L. • Can show that for Gaussian data, minimizing –L is equivalent to minimizing the variance-weighted sum of squared residuals (=chi squared) given before. • Proof left as an exercise!
Poissonian/likelihood version of slide 3 Model – red is si, green the flat background. Map of the joint likelihood L. The data yi:
What if also errors in xi? • Tricky… Bayes better in this case.
What next? • In fitting a model, we want (amplifying a bit on lecture 4): • The best fit values of the parameters; • Then we want to know if these values are good enough! • If not: need to go back to the drawing board and choose a new model. • If the model passes, want uncertainties in the best-fit parameters. • (I’ll put this off to a later lecture…) • Number 1 is accomplished. √
How to tell if our model is correct. • Supposing our model is absolutely accurate. • The U value we calculate is, nevertheless, a random variable: each fresh set of data will give rise to a slightly different value of U. • In other words, U, even in the case of a perfectly accurate model, will have some spread – in fact, like any other random variable, it will have a PDF. • This PDF is sometimes calculable from first principles (if not, one can do a Monte Carlo to estimate it).
How to tell if our model is correct. • The procedure is: • First calculate the PDF for U in the ‘perfect fit’ case; • From this curve, obtain the value of the PDF at our best-fit value of U; • If p(Ubest fit) is very small, it is unlikely that our model is correct. • Note that both χ2 and –L have the property that they cannot be negative. • A model which is a less than ideal match to the truth function will always generate U values with a PDF displaced to higher values of U.
Perfect vs. imperfect p(U): A perfect model gives this shape PDF PDF for imperfect model is ALWAYS displaced to higher U.
Goodness of model continued • Because plausible Us are >=0; and because an imperfect model always gives higher U: we prefer to • generate the survival function for the perfect model; • that tells us the probability of a perfect model giving us the measured value of Uor higher. • This procedure is called hypothesis testing. • Because we make the hypothesis: • “Suppose our model is correct. What sort of U value should we expect to find?” • We’ll encounter the technique again next lecture when we turn to enquire if there is any signal at all buried in the noise.
Perfect-model p(U)s: • If we use the least-squares U (also known as χ2), this is easy, because p(U) is known for this: where • Г is the gamma function • and υ is called the degrees of freedom. • Note: the PDF has a peak at U~υ.
What are degrees of freedom? • The easiest way to illustrate what degrees of freedom is, is to try fitting a polynomial of higher and higher order to a set of noisy data. • The more orders we include, the nearer the model will fit the data, and the smaller the sum of squared residuals (χ2) will be, until… • when M=N (ie the number of parameters, polynomial orders in this case, equals the number of data points), the model will go through every point exactly. χ2 will equal 0.
Degrees of freedom • Defined as N-M: number of data points minus number of parameters fitted. • It is sometimes convenient to define a reduced chi squared • PDF for χ2reduced should of course peak at about 1. • There is no advantage in using this for minimization rather than the ‘raw’ χ2.
‘Survival function’ for U. • Remember the survival function of a PDF is defined as • For χ2 this is • where Г written with 2 arguments like this is called the incomplete gamma function: