260 likes | 388 Views
Vorlesung Generalized Linear Regressionmodels. Antonia Rom. Chapter 4 - Modeling of B inary Data Introduction What is important in modeling? Problems, Obstacles 4.1 Maximum Likelihood Estimation What is the ML-estimation? Single Binary Response Grouped Data
E N D
Vorlesung Generalized Linear Regressionmodels Antonia Rom
Chapter 4 - Modeling of Binary Data Introduction What is important in modeling? Problems, Obstacles 4.1 Maximum Likelihood Estimation What is the ML-estimation? Single Binary Response Grouped Data Asymptotic Properties Existence of ML-Estimates Estimation conditioned on predictor values
Introduction A generalized linear model consists of: - a probability distribution from the exponential family - a linear predictor η = Xβ - a link function (with a response function h such that E(Y) = h-1(η)) Binary regression model h is a fully specified function. In this chapter the logit model is used. In this case h is the logistic distribution function Linear predictor Probability Response function
Introduction The link function is the inverse of the response function = h’ It determines functional form of the response probabilities. The Linear predictor determines which variables are included and in what form they determine the response - The unknown parameters, β, can be estimated with maximum likelihood estimation. The maximum likelihood estimation is a iterative algorithm. -> Linear predictors can contain polynomial versions of continuous variables, dummy variables and interaction effects. Care should be taken when specifying constituents of the model like the linear predictors!
Introduction Discrepancy between data and model. Does the fit of the model support the inferences drawn in the model? Relevance of variables and form of the linear predictor. Which variables should be included and how? Explanatory power of the covariates Prognostic power of the model Choice of link funkction. Which link funktion fits the data well and has a simple interpretation! Aspects are not independent: Model should present appropriate approximation with simple predictor, specification determines the goodness-of-fit linear predictor aims at finding an adequate form of covariates, reducing variable set, explanatory value aims at quantifying the effect of the covariates within the model First chapter about estimation - Maximum likelihood estimation!!
Maximum-Likelihood Estimation Basic principle is to construct the likelihood of the unknown parameters for the sample data. (Which parameter (mean, variance) makes the sample the most likely.) The distribution has to be known! The likelihood represents the joint probability or probability density of the observed data, considered as a function of the unknown parameters. What does this mean in praxis? Example: 2 MP3 – Players , exact the same, only shuffle mode without display! 1 with 5 songs 1 with 20 songs Each MP3-Player contains your favorite song, Unfortunately you mixed both of them. So you take one, turn it on and your favorite song is played. If you would have to bet, which one would you choose? – The one with 5 songs!!!
Maximum-Likelihood Estimation An event A happened. One tries to find inference on an underlying variable B (e.g. a special parameter). Therefore one looks on the conditional probability for A for all possible estimations ˆbi of B, if ˆbi is true. The value of ˆbi, for which P(A|ˆbi) is a maximum, is the best predictor for b. The conditional probability P(A|ˆbi) counts for the given event A. P(A|ˆbi) is also called L(ˆbi) (Likelihood of ˆbi). The ML-estimator is the value, for which the likelihood is a maximum. -> therefore the name Maximum Likelihood.
Maximum-Likelihood Estimation If there are too many ˆbi, it is not possible to calculate every L(ˆbi).In this case a Likelihood-Function is built and the maximum is calculated with L’(ˆb) = 0. Fact (1) The likelihood L(X) is not the probability for the event x to happen, but the conditional probability for the already happened event y, if x already happened before. L(X) = P(Z|X) Fact (2) Sum of all Likelihoods is not 1.
MLE – Definition: Single binary response The likelihood for the response is given by Observations are considered independent. The maximum likelihood of β are those values of β^ that maximizes the likelihood. L values can get very small so log-likelihood is used instead The value β^ can be obtained by solving the system of equations Derivatives are the so-called score function s(β) (iterative solving)
MLE – Definition: grouped data – binomially distributed responses Several, independent binary responses P is assumed to depend on x only, the mean is assumed to be the same for all the binary observations at this value. The model has the form: For the collection of binary variables the likelihood has the form The likelihood for the number of success defined as Lbin(β) and the binary observation likelihood L(β) differ in the binomial factor , which is irrelevant during maximization, because it doesn‘t depend on β. Therefore the log-likelihood is: The score function of the logit model is:
MLE – Definition: Asymptotic Properties The MLE has some favorable properties. The MLE estimator exists and is unique asymptotically. It is consistent and asymptotically normally distributed. It is asymptotically efficient. Consistency Likelihood is a smooth function and behaves in a nice way, and it‘s maximum is achieved in a unique point Two functions Ln and L are getting closer, the points of each maximum should also get closer which exactly means that
MLE – Definition: Asymptotic Properties Asymptotic normality: The estimator not only converges to the unknown parameter, but it converges fast enough. In MLE theory the asymptotic variance of the estimator is determined by the information or the Fischer-Matrix For binary data For grouped data
MLE – Definition: Existence of maximum Likelihood Estimates For a finite sample size it may happen, that ML estimators do not exist. ML-Estimates do not exist, when you have a data set with complete separation ML-Estimates may not exist, if you have a data set with quasi-complete separation. ML-Estimates do exist, when you have a data set with overlap. ML- Estimates do exist, when you have a data set with linear dependency. ML-Estimates exist, if there is no hyper plane that separates the 0 and the 1 responses.
MLE – Definition: Estimation conditioned on Predictor Values Sometimes samples can be conditional on the response y. In such stratified samples one observes x values given at y=1 and x values given at y=0. A common case is case-control studies in biomedicine., where y = 1(cases) and y=0 (controls) (choice-based sampling in econometrics) Let us consider the most simple case of binary predictor with y={0,1} and x={0,1} with is the odds ratio, which contains the association between y and x Parameter of association is the same e estimate coefficient β of the original logit model
MLE – Definition: Estimation conditioned on Predictor Values This way might be motivated by the specific structure of the logit model. We go back to chapter 2.2.2 Derivation of the binary logit model to assume that perdictors are normally distributed. denoting the density given y=r denoting the marginal probability From the Bayer‘s theorem, follows: Therefore or holds. This shows that a logit model holds if has a linear form and contains and only the intercept depends on the marginal probabilities. The important point is, that the marginals determine only the intercept!
MLE – Definition: Estimation conditioned on Predictor Values The likelihood for a given y differs from the likelihood given predictors. By using one obtains for the log-likelihood conditional on y Marginal distribution of y (fixed by the sampling) Equivalent to the conditional log-likelihood Marginal distribution of x (can be maximized by empirical distribution)
Summary general binary model: link function and linear predictor Care should be taken when estimating these constituents! Maximum – Likelihood Estimation Basic principle is to construct the likelihood of the unknown parameters for the sample data! MLE can cope with difficult and complicated linear predictors (interactions, dummy variables etc.) iterative algorithm Properties of MLE It is consistent and asymptotically normally distributed. It is asymptotic efficient. (Fischer-Matrix) Maximum-Likelihood estimators might not exist. They do exist when the data set has overlap or linear dependency. Depending on the data set, ML can also be conditional on the response y.
Man beachte den feinen Unterschied: für die Wahrscheinlichkeitsfunktion interessierten wir uns, weil sie uns die Eintrittswahrscheinlichkeiten von Realisationen für gegebene Parameter θ angibt. Bei der Likelihoodfunktion nehmen wir die Stichprobe als gegeben an und interessieren uns für den unbekannten Parameter θ, der die Realisation der gegebenen Stichprobe ‘am wahrscheinlichsten’ macht!
MLE - Example • Youcannowcalculatetheprobabilityof Bryant scoringtheamounts he actuallyscored. • Basic principleof MLE!!! • toconstructthelikelihoodoftheunknownparametersforthe sample data • Let f(ε) denote the density function for ε. (Recall that the density function is like a probability • function, and that the density for a normal variable is a bell curve with its maximum at ε=0.) • Given the prediction M and the density function, you can compute the probability of Bryant scoring any particular point total Y. This is given by the formula f(Y-M) = f(ε). • - For example, if you believe that M=32, then the probability that Bryant scores 35 is • given by f(35-32) = f(3). • If σ=6, for example, then examination of the normal table reveals f(3) = 08 • Assume that Bryant’s scoring in one game is independent of what he scored in the prior game. • - Recall that the probability of two independent events occurring is just the product of the • probability that each occurs. • - It follows that the probability, or likelihood, of Bryant scoring exactly 33, 22, 25, 40, and 30 points is just the product of the probabilities of his getting each of these scores. • Given any prediction M, you can write the likelihood score as: • Likelihood score = L = f(33-M) · f(22-M) · f(25-M) · f(40-M) · f(30-M).
MLE - Example • You want to find “maximum likelihood estimator” (MLE) of M! • This is the value of M that maximizes L • - Intuitively, you know that the MLE of M would not be 15 or 50 or some number far from his typical scoring output. It is almost impossible that a player who is predicted to score 15 points per game would actually score 33, 22, 25 40, and 30. • In fact, if M = 15 and σ= 6, then • L= f(33-15) · f(22-15) · f(25-15) · f(40-15) · f(30-15) = f(18) · f(7) · f(10) · f(25) · f(15) < .0000001 • But 32 might be a good candidate to be the MLE. Someone predicted to score 32 points per game has a reasonable chance of scoring 33, 22, 25, 40, and 30. • - In this case, L= f(1) · f(-10) · f(-7) · f(8) · f(-2) ≈.00005 • - It turns out that MLE estimate of M is given by the mean of the realized values of Y. • That is, M = 30 and L= .00014