941 likes | 1.23k Views
Overview of Maximum Likelihood Estimation: Part II. Use of Maximum Likelihood Techniques. Simple Example : y 1 ,y 2 ,… y T is an iid random sample where y t ~N ( μ , σ 2 ) Θ =( μ , σ 2 ), σ 2 > 0 We want to determine value of μ and σ 2 that max the log-likelihood (L)
E N D
Overview of Maximum Likelihood Estimation: Part II
Use of Maximum Likelihood Techniques • Simple Example: y1,y2,…yT is an iid random sample where yt~N(μ,σ2) • Θ=(μ,σ2), σ2 > 0 • We want to determine value of μ and σ2 that max the log-likelihood (L) • What do you think should be the maximum likelihood σ2 and μ values? Normal PDF Total sample likelihood function Due to IID assumption 2
Use of Maximum Likelihood Techniques • From the above, the sample log-likelihood function can be represented as: • One can show that the above is a concave function of Θ = (μ, σ2) 3
Use of Maximum Likelihood Techniques • FOC for maximizing L(•) wrtμ, μl: ∂L/∂μ = [(Σtyt) − T μl]/σl2 = 0 → [(Σtyt) − T μl]=0 → μl = Σtyt/T σ2 > 0 4
Use of Maximum Likelihood Techniques • FOC for maximizing L(•) wrtσ2,σl2: ∂L/∂(σl2) = −T/(2σl2) + [Σt(yt− μl)2]/(2σl4) = 0 • → −T+[Σt(yt− μl)2]/σl2 = 0 • → T−[Σt(yt− μl)2]/σl2 = 0 • → σl2T−[Σt(yt− μl)2]= 0 • → σl2 = Σt(yt− μl)2/T • → μs=μl, σs2= σl2 where μs and σs2 are least squares estimates of mean/variance • This has important implications for use of ML techniques and the CRM 5
MaximumLikelihood Estimates of CRM β and s2 • Under the CRM: • y = Xb + e where et~(0,s2) • y~ (Xb,s2IT) • With Normality: • y = Xb + e where et~N(0,s2) • →y~N(Xb,s2IT) • → yt selected from a population with the distribution, N(Xtβ,σ2) Remember, β & σ2 are true, unknown values 6
MaximumLikelihood Estimates of CRM β and s2 • What is the PDFofytunder the above normality assumption: or: • → the likelihood function lt(β,σ|Xt,yt), for the tth observation • Assume a homoscedastic/non-autocorrelated error structure E(yt) parameters yt-Xtβ 7
MaximumLikelihood Estimates of CRM β and s2 • With the T observations being iid, what is the samplelikelihoodfunction? • This implies for the entire sample under the assumed error structure we have: (1 x T) (T x 1) (1 x 1) sample SSE’s Remember eδeψ=e(δ+ψ) 8
MaximumLikelihood Estimates of CRM β and s2 • Reason for calling the above a likelihood function (LF): • Given y, choose β and σ2 for which the LF is large: • The chosen β and σ2’s are more likely to be the true values than parameter estimates for which the LF is small • →The observed y’s are more plausible with that particular choice of β and σ2 9
MaximumLikelihood Estimates of CRM β and s2 • Using this sample LF, how do we obtain an estimate of b and σ2? • W/normal distribution → log form (positive monotonic transformation) constant • Value of β and σ2 that maximizes L will also maximize the original likelihood function, l(•) Note: my notation differs from Greene’s 10
MaximumLikelihood Estimates of CRM β and s2 • Note, the last term of log-likelihood function, L is the only one that contains β • Max. L wrt β is equivalent to • Max. [-e′e] which is equiv. to • Min. e'e due to negative sign and constant 2σ2 • Minimizing e'e is the least squares criterion • →βML = βS=(X'X)-1X′y • →σ2ML = es'es/T (y-Xβ)'(y-Xβ) 11
Use of Maximum Likelihood Techniques • Under some fairly general conditions, the Maximum Likelihood Estimate (MLE) of Θ, ΘML is • consistent • asymptotically normal (regardless of distribution assumption of Y) • asymptotically unbiased • asymptotically efficient • The finite sample properties of MLE are sometimes not nice • For example, the MLE may be biased • For many applications, the appeal of the MLE are its asymptotic properties • Varies from problem to problem
Use of Maximum Likelihood Techniques • In summary we have the following asymptotic properties of a MLE where Θ0is thetrue, unknown value of Θ given certain regularity conditions • M1:Consistency: plim σML= Θ0 • M2:Asymptotic Normality: • Expected value operator above implies that we replace y with E(y) wherever y appears in the Hessian 13
Use of Maximum Likelihood Techniques • M3:Asymptotic Efficiency: ΘML is asymptotically efficient and achieves the Cramer-Rao lower bound for consistent estimators given M2 and Theorem C.2 in Greene p.1031 • M4:Invariance: The maximum likelihood estimator of a function of the true unknown parameters, γ0 = c(Θ0) is c(ΘML) if c(Θ0) is a continuous and continuously differentiable function 14
Use of Maximum Likelihood Techniques • Theorem C.2: Cramer-Rao Lower Bound • Assume the PDF of y satisfies the regularity conditions discussed below • The asymptotic variance of a consistent and asymptotically normally distributed estimator of parameter vector, Θ0 will always be at least as large as • I(Θ) is the sample information matrix • →diagonal elements of I(Θ)-1 will be less than or equal to the asymptotic parameter variance estimate
Use of Maximum Likelihood Techniques • Theorem C.2: Cramer-Rao Lower Bound • Sample Information Matrix: the negative of the expected value of the Hessian of the log-likelihood function Expected value of y used wherever y appears in the Hessian 16
Use of Maximum Likelihood Techniques • Theorem C.2: Cramer-Rao Lower Bound • →The difference between an estimated covariance matrix and the CRLB is positive semi-definite • →Parameter estimates are asymptotically efficient if the estimated covariance matrix estimate equals the CRLB • In summary • Let Θ* be a consistent estimator of Θ such that • Then Θ* is asymptotically efficient if • Refer to JHGLL, p. 87 for more detail 17
Use of Maximum Likelihood Techniques • Theorem C.2: Cramer-Rao Lower Bound • Greene (Ch. 16) proves that the expected value of the negative of the Hessian of the LLF matrix equals the expected square of the 1st derivative matrix I(•) is (K x K) ∂L/∂Θ is (K x 1) L(•) is the total sample LLF (1 x 1) (K x K)
Use of Maximum Likelihood Techniques • Assume that (y1,…,yT) is a random sample from the population with density function f(yi|Θ0) and that the following regularity conditions hold: • R1: The 1st three derivatives of ln[f(yi|Θ)] wrt Θ are continuous and finite for almost all yi and for all Θ • This ensures the existence of a certain Taylor Series approximation and a finite variance of the likelihood function derivatives
Use of Maximum Likelihood Techniques • R2: Expectations of the 1st and 2nd derivatives of ln[f(y|Θ)] are able to be evaluated • R3: For all values of Θ, is less than a function that has a finite expectation. This condition will ensure the Taylor Series noted in R1 can be truncated • Given these regularity conditions the following properties of PDF’s represented by f(yi|Θ) are able to be obtained: PDF
Use of Maximum Likelihood Techniques • P1:ln[f(yi|Θ)] • gi ≡ ∂ln[f(yi|Θ)]/∂Θ and • Hi ≡ ∂2 ln[f(yi|Θ)]/∂Θ∂Θ′ (i=1,…,T) are • random samples of random variables • This follows from the assumption of random sampling • P2: E[gi(θML)] = 0 • P3: Var[gi(θML)] = −E[Hi(θML)] • Given the above: • P1 is a consequence of the definition of a likelihood function • P2 identifies the condition that defines the maximum likelihood estimator • P3 produces the Information Matrix Equality which shows how to obtain the MLE asymptotic covariance matrix θML = optimal value
Use of Maximum Likelihood Techniques • The asymptotic covariance matrix of the MLE’s is a matrix whose elements are a function of the θ’s that are being estimated • The above evaluated at ΘML to estimate the covariance matrix • Referred to as the method of scoring system for calculating the parameter covariance matrix Remember in my notes L(∙) is the log of the sample LF
Use of Maximum Likelihood Techniques • The log-likelihood 2nd derivatives may be complicated nonlinear functions of the data whose exact expected values unknown • Two alternative estimators • Evaluate the actual [versus expected] 2nd derivatives at the MLE values. • Referred to as the Newton-Raphson (NR) method for calculating the parameter covariance matrix • 2nd derivatives still could be complicated log of the sample LF No expectation
Use of Maximum Likelihood Techniques gi ≡ ∂ln[f(yi|Θ)]/∂Θ Var[gi(θML)] = −E[Hi(θML)] • From P3 above we have that the expected 2nd derivatives matrix is the covariance matrix of the first derivativesvector evaluated at ΘML • Using this result, a 3rd estimator of the parameter covariance matrix is: G is a T x K matrix with tth row equal to the transpose of the tth vector of derivatives of the contribution of the tth observation to the sample LLF Greene, 14-18, p. 522 (K x K)
Use of Maximum Likelihood Techniques (K x 1) • For a single parameter, this estimator is just the reciprocal of the sum of squares of the first derivatives of the log-likelihood function • Extremely convenient because it does not require any computations beyond those required to solve the likelihood function (e.g. first derivatives). • It is always non-negative definite • Using the 1st derivatives referred to as the BHHH (Berndt, Hall, Hall and Hausman) estimator of the parameter covariance matrix
Use of Maximum Likelihood Techniques • None of the 3 covariance matrix estimators (Scoring, NR, BHHH) are preferred on statistical grounds • May differ significantly when dealing with small samples 26
Use of Maximum Likelihood Techniques • Two general methods for obtaining parameter estimates: • Grid Search (e.g., 1 parameter) l(Θ|Y) l(Θ|Y) Θ Θ0 Θ* Optimal value Initial guess
Use of Maximum Likelihood Techniques • Two general methods for obtaining parameter estimates: • Grid Search • Standard Calculus • With l(Θ|Y), slope of sample likelihood function is 0 at optimum value for a particular parameter • For single parameter, we would like the second derivative of the likelihood function evaluated at the optimal value to be > 0, < 0 or = 0? (the slope is declining)
Use of Maximum Likelihood Techniques • Two general methods for obtaining parameter estimates: • Grid Search • Standard Calculus • For multiple parameters, the matrix of second partial derivatives (e.g., Hessian) of L(•) must have what characteristic given that we are attempting to maximize the sample log-likelihood? • e.g, |H1|<0; |H2|>0; . . . (principle minors of the Hessian alternate in sign starting with |H1|<0, negative definite)
Use of Maximum Likelihood Techniques • Suppose we have the following data on income and educational attainment • Example found in Greene, p. 504 • Click here for the data
Use of Maximum Likelihood Techniques • We assume that there is a fixed relationship between income and education where • Suppose the conditional PDF of income is given by the following: • To find the MLE of β, we maximize the following sample log-likelihood function Restricted form of the gamma distribution
Use of Maximum Likelihood Techniques • To maximize L(·) we need: • The above does not have an analytical solution given INC appears in the numerator of the last ratio term • To find the maximum likelihood estimate of β, one can use any of a number of numerical methods to find the value of β that maximizes the log-likelihood • We will review several of these • Similar to the NLS numerical methods
Use of Maximum Likelihood Techniques L(Θ) Example Log. Likelihood Function Θ • With above likelihood function, we can find the value of Θ that maximizes L(Θ) by trial and error • Evaluate L(Θ) for a range of Θ’s that covers the range of values • Repeat the process for finer grid of values of Θ • Can work quite well when only 1 or 2 parameters but cumbersome for more
Use of Maximum Likelihood Techniques • At a given starting point, for a single parameter one can consider points on either side of a starting value to determine which direction increases L(Θ) • As soon as one registers a decline, the direction is reversed and step length reduced Θ1-Θ0= Θ2-Θ1= Θ3-Θ2 > Θ3-Θ* L(Θ) Θ Θ0 Θ1 Θ2 Θ* Θ3
Use of Maximum Likelihood Techniques • When we have more than a few parameters, the above direct search method is impractical • As an alternative we can use the fact that at the optimum values of the unknown parameters we know that the gradients (score) of the log-likelihood function wrt to these parameters are 0 • As a simple example suppose we have the following log-likelihood function • L=0.75*x1 –.045*x12 –.025x13 • x1 from 0.1 → 4.6 • Gradient: 0.75 – 0.90x1 – 0.075 x12 • 2nd derivative: – 0.90 – 0.15 x1
Use of Maximum Likelihood Techniques Log-Likelihood Function Gradient Function
Use of Maximum Likelihood Techniques • Similar to what we did with respect to NLS estimation • One can approximate the gradient of the score of the log-likelihood [e.g., the Hessian of L(•)] between two points Θ0 and Θ1via a Taylor series approximation • To see this lets look at the following gradient (score) function (and assume we have a single parameter) parameter Single Parameter
Use of Maximum Likelihood Techniques ∂L(Θ)/∂Θ score function ∂L(Θ)/∂Θ0 A B ∂L(Θ)/∂Θ1 0 Θ0 Θ1 Θ < 0 > 0 0B 0A
Use of Maximum Likelihood Techniques • From the above approximation we have: • This implies: 0B 0A Θ0 is original parameter First-Order Taylor series approximation to the gradient of the log-like. function
Use of Maximum Likelihood Techniques • We know at the optimum, ∂L(Θ)/∂Θ=0 and if Θ1 is the optimum original parameter value H(·) ≡ hessian of Log-Likelihood G(·) ≡ gradient of Log-Likelihood
Use of Maximum Likelihood Techniques • Therefore in the neighborhood of the optimum we have: Θj+1 -Θj = -H(Θj)-1G(Θj) →Θj+1 = Θj-H(Θj)-1G(Θj) • The above is a general method for moving from an initial parameter vector guess to a numerical solution to log-likelihood maximizing • An iterative process • One continues until there is no change in parameter values across iterations • What represents no change is defined by you given that for large models never achieve absolute zero change j=current value (Kx1) (KxK) (Kx1)
Use of Maximum Likelihood Techniques Prev. Result: Θj+1= Θj − H(Θj)-1G(Θj) • The general structure of Maximum Likelihood (ML) estimation algorithms can be represented via the following flowchart • It should be noted that in this discussion I will be generating an overview of these algorithms using the following general model specification: y=f(X,β) + e where e~N(0,σ2It) • In a later section I will review the use of ML techniques in estimating the CRM with general error variances
Use of Maximum Likelihood Techniques Θ=(β,σ2) General Alogorithm yt=f(Xt,)+et et~N(0,2) Generate Initial Guess for Θ, Θn L(Θ|y,X) Approx. dL(Θ)/dΘ by First-Order Taylor Series Around Θn Flexible Functional Form Update Estimate of Θ to Θn+1 No P may or may not be the Hessian depending on algorithm Θn = Θn+1 Check Optimality of L(Θ) Yes ML, σ2ML, 43
Use of Maximum Likelihood Techniques • Consider the model: • Sample likelihood function (normally distributed homoscedastic errors): Sum of squared errors e
Use of Maximum Likelihood Techniques • Sample Log-Likelihood function: • For nonlinear functions, f(·), it is not in general possible to find an analytical expression for the ML estimators of β where ∂L/∂β=0 • It is however possible to find an expression for the ML estimator for σ2 • Differentiating the above wrtσ2 and setting to 0→ 2 sets of parameters to estimate, β, σ2 regardless of f(•)
Use of Maximum Likelihood Techniques • It is now possible to write the LLF just in terms of β which results in the concentrated log-likelihood function (concentrated in terms of β) -(T/2)[ln(2π)+1]
Use of Maximum Likelihood Techniques • →the ML estimator of β is identical to the estimator that minimizes S(β) • What does this mean wrt the CRM and the use of NLS if we have normally distributed homoscedastic, nonautorcorrelated errors? Remember f(X,β) may be nonlinear
Use of Maximum Likelihood Techniques Note general function • Consider the model: • Lets review how we can use the above information to find the values of β that maximize the associated likelihood function
Use of Maximum Likelihood Techniques • Three (of many) maximum likelihood (ML) estimation algorithms • Newton Raphson (NR) • Method of Scoring (Gauss-Newton, GN) • BHHH (Berndt, Hall, Hall and Hausman, BHHH) • The general NR, GN and BHHH algorithms can be represented via the following iteration step which we derived earlier and which differ in defining Pn: JHGLL, p.524-527 Pn used in place of H(•)-1 NR Result Θn+1=Θn - H(Θn)-1G(Θn)
Use of Maximum Likelihood Techniques • Algorithms differ in calculation of Pn: • In terms of the above general model Reviewed earlier in terms of this problem Could have variable step length a scaler