Maximum Likelihood Estimation

Maximum Likelihood Estimation Psych 818 - DeShon

MLE vs. OLS • Ordinary Least Squares Estimation • Typically yields a closed form solution that can be directly computed • Closed form solutions often require very strong assumptions • Maximum Likelihood Estimation • Default Method for most estimation problems • Generally equal to OLS when OLS assumptions are met • Method yields desirable “asymptotic” estimation properties • Foundation for Bayesian inference • Requires numerical methods :(

MLE logic • MLE reverses the probability inference • Recall: p(X|) •  represents the parameters of a model (i.e., pdf) • What’s the probability of observing a score of 73 from a N(70,10) distribution • In MLE, you know the data (Xi) • Primary question: Which of a potentially infinite number of distributions is most likely responsible for generating the data? • p(|X)?

Likelihood • Likelihood may be thought of as an unbounded or unnormalized probability measure • PDF is a function of the data given the parameters on the data scale • Likelihood is a function of the parameters given the data on the parameter scale

Likelihood • Likelihood function • Likelihood is the joint (product) probability of the observed data given the parameters of the pdf • Assume you have X1,…,Xn independent samples from a given pdf, f

Likelihood • Log-Likelihood function • Working with products is a pain • maxima are unaffected by monotone transformations, so can take the logarithm of the likelihood and turn it into a sum

Maximum Likelihood • Find the value(s) of  that maximize the likelihood function • Can sometimes be found analytically • Maximization (or minimization) is the focus of calculus and derivatives of functions • Often requires iterative numeric methods

Likelihood • Normal Distribution example • pdf: • Likelihood • Log-Likelihood • Note: C is a constant that vanishes once derivatives are taken

Likelihood • Can compute the maximum of this log-likelihood function directly • More relevant and fun to estimate it numerically!

Normal Distribution example • Assume you obtain 100 samples from a normal distribution • rv.norm <- rnorm(100, mean=5, sd=2) • This is the true data generating model! • Now, assume you don’t know the mean of this distribution and we have to estimate it… • Let’s compute the log-likelihood of the observations for N(4,2)

Normal Distribution example • sum(dnorm(rv.norm, mean=4, sd=2, log=T)) • dnorm gives the probability of an observation for a given distribution • Summing it across observations gives the log-likelihood • = -221.0698 • This is the log-likelihood of the data for the given pdf parameters • Okay, this is the log-likelihood for one possible distribution….we need to examine it for all possible distributions and select the one that yields the largest value

Normal Distribution example • Make a sequence of possible means • m<-seq(from = 1, to = 10, by = 0.1) • Now, compute the log-likelihood for each of the possible means • This is a simple “grid search” algorithm • log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log=T)) )

Normal Distribution example mean log.l 1 1.0 -417.3891 2 1.1 -407.2201 3 1.2 -397.3012 4 1.3 -387.6322 5 1.4 -378.2132 6 1.5 -369.0442 7 1.6 -360.1253 8 1.7 -351.4563 9 1.8 -343.0373 10 1.9 -334.8683 11 2.0 -326.9494 12 2.1 -319.2804 13 2.2 -311.8614 14 2.3 -304.6924 15 2.4 -297.7734 16 2.5 -291.1045 17 2.6 -284.6855 18 2.7 -278.5165 19 2.8 -272.5975 20 2.9 -266.9286 21 3.0 -261.5096 22 3.1 -256.3406 23 3.2 -251.4216 24 3.3 -246.7527 25 3.4 -242.3337 26 3.5 -238.1647 27 3.6 -234.2457 28 3.7 -230.5768 29 3.8 -227.1578 30 3.9 -223.9888 31 4.0 -221.0698 32 4.1 -218.4008 33 4.2 -215.9819 34 4.3 -213.8129 35 4.4 -211.8939 36 4.5 -210.2249 37 4.6 -208.8060 38 4.7 -207.6370 39 4.8 -206.7180 40 4.9 -206.0490 41 5.0 -205.6301 42 5.1 -205.4611 43 5.2 -205.5421 44 5.3 -205.8731 45 5.4 -206.4542 46 5.5 -207.2852 47 5.6 -208.3662 48 5.7 -209.6972 49 5.8 -211.2782 50 5.9 -213.1093 51 6.0 -215.1903 52 6.1 -217.5213 53 6.2 -220.1023 54 6.3 -222.9334 55 6.4 -226.0144 56 6.5 -229.3454 57 6.6 -232.9264 58 6.7 -236.7575 59 6.8 -240.8385 60 6.9 -245.1695 61 7.0 -249.7505 62 7.1 -254.5816 63 7.2 -259.6626 64 7.3 -264.9936 65 7.4 -270.5746 66 7.5 -276.4056 67 7.6 -282.4867 68 7.7 -288.8177 69 7.8 -295.3987 70 7.9 -302.2297 71 8.0 -309.3108 72 8.1 -316.6418 73 8.2 -324.2228 74 8.3 -332.0538 75 8.4 -340.1349 76 8.5 -348.4659 77 8.6 -357.0469 78 8.7 -365.8779 79 8.8 -374.9590 80 8.9 -384.2900 81 9.0 -393.8710 82 9.1 -403.7020 83 9.2 -413.7830 84 9.3 -424.1141 85 9.4 -434.6951 86 9.5 -445.5261 87 9.6 -456.6071 88 9.7 -467.9382 89 9.8 -479.5192 90 9.9 -491.3502 91 10.0 -503.4312 Why are these numbers negative?

Normal Distribution example • dnorm gives us the probability of an observation from the given distribution • The log of a value between 0-1 is negative • Log(.05)=-2.99 • What’s the MLE? • m[which(log.l==max(log.l))] • = 5.1

Normal Distribution example • What about estimating both the mean and the SD simultaneously? • Use grid search approach again… • Compute the log-likelihood at each combination of mean and SD SD Mean log.l 1 1.0 1.0 -1061.6201 2 1.0 1.1 -1022.2843 3 1.0 1.2 -983.9486 4 1.0 1.3 -946.6129 5 1.0 1.4 -910.2771 6 1.0 1.5 -874.9414 7 1.0 1.6 -840.6056 8 1.0 1.7 -807.2699 9 1.0 1.8 -774.9341 10 1.0 1.9 -743.5984 11 1.0 2.0 -713.2627 12 1.0 2.1 -683.9269 13 1.0 2.2 -655.5912 14 1.0 2.3 -628.2554 15 1.0 2.4 -601.9197 16 1.0 2.5 -576.5839 17 1.0 2.6 -552.2482 18 1.0 2.7 -528.9125 19 1.0 2.8 -506.5767 20 1.0 2.9 -485.2410 853 1.9 4.3 -211.3830 854 1.9 4.4 -209.6280 855 1.9 4.5 -208.1499 856 1.9 4.6 -206.9489 857 1.9 4.7 -206.0249 858 1.9 4.8 -205.3779 859 1.9 4.9 -205.0078 860 1.9 5.0 -204.9148 861 1.9 5.1 -205.0988 862 1.9 5.2 -205.5599 863 1.9 5.3 -206.2979 864 1.9 5.4 -207.3129 865 1.9 5.5 -208.6049 866 1.9 5.6 -210.1740 867 1.9 5.7 -212.0200 868 1.9 5.8 -214.1431 869 1.9 5.9 -216.5432 870 1.9 6.0 -219.2203 871 1.9 6.1 -222.1743 872 1.9 6.2 -225.4054 873 1.9 6.3 -228.9135 6134 7.7 4.6 -299.1132 6135 7.7 4.7 -299.0569 6136 7.7 4.8 -299.0175 6137 7.7 4.9 -298.9950 6138 7.7 5.0 -298.9893 6139 7.7 5.1 -299.0006 6140 7.7 5.2 -299.0286 6141 7.7 5.3 -299.0736 6142 7.7 5.4 -299.1354 6143 7.7 5.5 -299.2140 6144 7.7 5.6 -299.3096 6145 7.7 5.7 -299.4220 6146 7.7 5.8 -299.5512 6147 7.7 5.9 -299.6974 6148 7.7 6.0 -299.8604 6149 7.7 6.1 -300.0402 6150 7.7 6.2 -300.2370 6151 7.7 6.3 -300.4506

Normal Distribution example • Get max(log.l) • m[which(log.l==max(log.l), arr.ind=T)] • = 5.0, 1.9 • Note: this could be done the same way for a simple linear regression (2 parameters)

Algorithms • Grid search works for these simple problems with few estimated parameters • Much more advanced search algorithms are needed for more complex problems • More advanced algs take advantage of the slope or gradient of the likelihood surface to make good guesses about the direction of search in parameter space • We’ll use the “mlm” routine in R

Algorithms • Grid Search: • Vary each parameter in turn, compute the log-likelihood, then find parameter combination yielding lowest log-likelihood • Gradient Search: • Vary all parameters simultaneously, adjusting relative magnitudes of the variations so that the direction of propagation in parameter space is along the direction of steepest ascent of log-likelihood • Expansion Methods: • Find an approximate analytical function that describes the log-likelihood hypersurface and use this function to locate the minimum. Number of computed points is less, but computations are considerably more complicated. • Marquardt Method: Gradient-Expansion combination

R – mlm routine • First we need to define a function to maximize • Wait! Most general routines focus on minimization • e.g., root finding for solving equations • So, usually minimize –log-likelihood • norm.func<-function(x,y) { sum(sapply(rv.norm, function(z) -1*dnorm(z, mean=x, sd=y, log=T))) }

R – mlm routine • norm.mle <- mle(norm.func, start=list(x=4,y=2), method="L-BFGS-B", lower=c(0, 0)) • Many interesting points • Starting values • Global vs. local maxima or minima • Bounds • SD can’t be negative

R – mlm routine • Output - summary(norm.mle) • Standard errors come from the inverse of the hessian matrix • Convergence!! • -2(log-likelihood) = deviance • Functions like the R2 in regression Coeficients: Estimate Std. Error x 4.844249 0.1817031 y 1.817031 0.1284834 -2 log L: 403.2285 > norm.mle@details$convergence [1] 0

Maximum Likelihood Regression • A standard regression: • May be broken down into two components

Maximum Likelihood Regression • First define our x's and y'sx<- 1:100 y<- 4 + 3*x+rnorm(100, mean=5, sd=20) • Define -log likelihood function reg.func <- function(b0,b1,sigma) { if(sigma<=0) return(NA) # no sd of 0 or less! yhat<-b0*x+b1 #the estimated function -sum(dnorm(y, mean=yhat, sd=sigma,log=T)) #the -log likelihood function }

Maximum Likelihood Regression • Call MLE to minimize the –log-likelihood lm.mle<-mle(reg.func, start=list(b0=2, b1=2, sigma=35)) • Get results - summary(lm.mle) Coefficients: Estimate Std. Error b0 3.071449 0.0716271 b1 8.959386 4.1663956 sigma 20.675930 1.4621709 -2 log L: 889.567

Maximum Likelihood Regression • Compare to OLS results • lm(y~x) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.95635 4.20838 2.128 0.0358 * x 3.07149 0.07235 42.454 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 20.88 on 98 degrees of freedom Multiple R-Squared: 0.9484,

Standard Errors of Estimates • Behavior of the likelihood function near the maximum is important • If it is flat then observations have little to say about the parameters • changes in the parameters will not cause large changes in the probability • if the likelihood has a pronounced peak near to the maximum then small changes in parameters would cause large changes in probability • In this cases we say that observation has more information about parameters • Expressed as the second derivative (or curvature) of the log-likelihood function • If more than 1 parameter, then 2nd partial deriviatives

Standard Errors of Estimates • Rate of change is the second derivative of a function (e.g., velocity and acceleration) • Hessian Matrix is the matrix of 2nd partial derivatives of the -log-likelihood function • The entries in the Hessian are called the observed information for an estimate

Standard Errors • Information is used to obtained the expected variance (or standard error) or the estimated parameters • When sample size becomes large then maximum likelihood estimator becomes approximately normally distributed with variance close to • More precisely…

Likelihood Ratio Test • Let LF be the maximum of the likelihood function for an unrestricted model • Let LR be the maximum of the likelihood function of a restricted model nested in the full model • LF must be greater than or equal to LR • Removing a variable or adding a constraint can only hurt model fit. Same logic as R2 • Question: Does adding the constraint or removing the variable (constraint of zero) significantly impact model fit? • Model fit will decrease but does it decrease more than would be expected by chance?

Likelihood Ratio Test • Likelihood Ratio • R = -2ln(LR / LF) • R = 2(log(LF) – log(LR)) • R is distributed as chi-square distribution with m degrees of freedom • m is the difference in the number of estimated parameters between the two models. • The expected value of R is m, so if you get an R that is bigger than the difference in parameters then the constraint hurts model fit. • More formally…. You should reference the chi-square table with m degrees of freedom to find the probability of getting R by chance alone, assuming that the null hypothesis is true.

Likelihood Ratio Example • Go back to our simple regression example • Does the variable (X) significantly improve our predictive ability or model fit? • Alternatively, does removing X or constraining it’s parameter estimate to zero significantly decrease prediction or model fit? • Full Model: -2log-L = 889.567 • Reduced Model: -2log-L =1186.05 • Chi-square critical value = 3.84

Fit Indices • Akaike’s information criterion (AIC) • Pronounced “Ah-kah-ee-key” • K is the number of estimated parameters in our model. • Penalizes the log-likelihood for using many parameters to increase fit • Choose the model with the smallest AIC value

Fit Indices • Bayesian Information Criterion (BIC) • AKA- SIC for Schwarz Information Criterion • Choose the model with the smallest BIC • the likelihood is the probability of obtaining the data you did under the given model. It makes sense to choose a model that makes this probability as large as possible. But putting the minus sign in front switches the maximization to minimization

Multiple Regression • -Log-Likelihood function for multiple regression #Note, theta is a vector of parameters, with std.dev being the first one#theta[-1] is all values of theta, except the first#and here we're using matrix multiplication ols.lf3 <- function(theta, y, X) { if (theta[1] <= 0) return(NA) -sum(dnorm(y, mean = X %*% theta[-1], sd = sqrt(theta[1]), log = TRUE))}

Maximum Likelihood Estimation