290 likes | 1.35k Views
Maximum Likelihood Estimation. Methods of Economic Investigation Lecture 17. Last Time. IV estimation Issues Heterogeneous Treatment Effects The assumptions LATE interpretation Weak Instruments Bias in Finite Samples F-statistics test. Today’s Class. Maximum Likelihood Estimators
E N D
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17
Last Time • IV estimation Issues • Heterogeneous Treatment Effects • The assumptions • LATE interpretation • Weak Instruments • Bias in Finite Samples • F-statistics test
Today’s Class • Maximum Likelihood Estimators • You’ve seen this in the context of OLS • Can make other assumptions on the form of likelihood function • This is how we estimate discrete choice models like probit and logit • This is a very useful form of estimation • Has nice properties • Can be very robust to mis-specification
Our Standard OLS • Standard OLS Yi = Xi’β + εi • Focus on minimizing mean squared error with an assumption that εi|Xi ~ N(0, σ2)
Another way to motivate linear models • “Extremum Estimators”: maximize/minimize some function • OLS Minimize Mean-Squared Error • Could also imagine minimizing some other types of functions • We often use a “likelihood function” • This approach is more general, allowing us to deal with more complex nonlinear models • Useful properties in terms of consistency and asymptotic convergence
What is a likelihood function • Suppose we have independent and identically distributed random variables {Zi, . . . ,ZN} drawn from a density function f(z; θ). Then the likelihood function given a sample • Because it is sometimes convenient, we often use this in logarithmic form
Consistency - 1 • Consider the population likelihood function with the “true” parameter θ0 • Think of L0as the population average and log L as the sample estimate, so that in the usual way
Consistency - 2 • The population likelihood function is maximized L0(θ) at the true value, θ0 . Why? • think of the sample likelihood function as telling us how likely it is one would observe the sample if theparameter value θ is really the true parameter value. • Similarly, the population likelihood function L0(θ) will be the largest at the value of θthat makes it most likely to “observe the population” • That value is true parameter value. ie θ0 = argmaxL0(θ).
Consistency - 3 • We now know that the population likelihood L0(θ) is maximized at θ0 • Can use Jensen’s inequality to apply this to the log function • the sample likelihood function log L(θ; z) gets closer to L0(θ) as N increases • i.e. log(L) will start having the same shape as L0 • For large N, the sample likelihood will be maximized at θ0
Information Matrix Equality • An additional useful property from the MLE comes from: • Define the score function as the vector of derivatives of the log likelihood function • Define the Hessian as the matrix of second derivatives of the log likelihood function
Asymptotic Distribution • Define the following: • Then the MLE estimate will converge in distribution to: • Where the information matrix I(θ) has the property that i.e. there does not exist a consistent estimate of θ with a smaller variance
Computation • Can be quite complex because need to numerically maximize • General procedure • Re-scale variables so they have roughly similar variances • Choose some starting value and estimated maximum in that areas • do this over and over across different grids • Get an approximation of the underlying objective function • If this converges to a single maximum—you’re done
Test Statistics • Define our likelihood function L(z;θ0,θ1) • Suppose we want to test H0: θ0 = 0 against the alternative HA: θ0 ≠ 0 • We could estimate a restricted and an unrestricted likelihood function
Test Statistics - 1 • We can test how “close” our restricted and unrestricted models might be • We could test if the restricted log likelihood function is maximized at θ0 = 0, the derivative of the log likelihood function with respect to 0 at that point should be close to zero.
Test Statistics - 2 • The restricted and unrestricted estimates of θ should be close together if the null hypothesis is correct • Partition the information matrix as follows • Define the Wald Test as:
Comparing test statistics • In large samples, these test statistics should converge in probability • In finite samples, the three will tend to generate somewhat different test statistics, • Will generally come to the same conclusion • The difference between the tests is how they go about answering that question. • The LR test requires estimates of both of the models • The W and LM tests approximate the LR test but require that only one model be estimated. • When model is linear the three test statistics have the following relationship W ≥ LR ≥ LM
OLS in the MLE context • Linear Model log Likelihood Function • Choose parameter values which maximize this:
Example 1: Discrete choice • Latent Variable Model: • True variable of interest is: Y*= X’β + ε • We don’t observe Y*but we can observe Y = 1[Y*>0] • Pr[Y=1] = Pr[Y*>0] = Pr[ε<X’β] • What to assume about ε? • Linear Probability Model: Pr[Y=1] = X’β • Probit Model: Pr[Y=1] = Ф(X’β) • Logit Model: Pr[Y=1] = exp(X’β)/ [1 + exp(X’β)]
Likelihood Functions • Probit • Logit
Marginal Effects • In the linear function we can interpret our coefficients as the change in the likelihood function with respect to the relevant variable, i.e. • In non-linear functions, things are a bit trickier. We get • We get the parameter estimate of β • But we want: • These are the “marginal effects” and are typically evaluated at the mean values of X
Next Time • Time Series Processes • AR • MA • ARMA • Model Selection • Return to MLE • Various Criterion for Model Choice