140 likes | 327 Views
ECLT 5810. Brief introduction on Logistic Regression. A “curve fitting on data points” procedure Achieved by minimizing the total squared distance between the curve and the data points The model usually looks like y = β 0 + β 1 x 1 + β 2 x 2 + · · · + β n x n.
E N D
ECLT 5810 Brief introduction on Logistic Regression
A “curve fitting on data points” procedure Achieved by minimizing the total squared distance between the curve and the data points The model usually looks like y = β0 + β1x1 + β2x2 + · · · + βnxn Review on Ordinary Least Square (OLS) Regression
Our analysis on such models are usually: • If the beta coefficients are significantly positive/ negative/ different from a certain value, with estimation errors considered. (done by t-statistic on beta estimates) • If the model has good explanatory power to describe the dependent variable, with estimation errors considered. (done by F-statistic on R^2 measures) • The implication from the model, i.e., does y depends on x ? In what extent? Are there any interaction effect? (Done by differentiation/differencing on the estimated model) • Prediction (in interval) given the dependent variable.
Classical Assumptions for OLS However, all those analysis is done under the following assumptions. A1 (Linear in Parameter) y = β0 + β1x1 + β2x2 + · · · + error. A2 (No perfect collinearity) No independent variable is constantor a perfect linear combination of the others A1 and A2 could be fulfilled by choosing a suitable form of equation.
A3 (Zero conditional mean of errors) • E(error t |X) = 0, t = 1, 2, · · · # of data, • where X is a collection of all independent variables • X = (x1, x2, · · · , xn) • Under A1-A3 the OLS estimators are unbiased, i.e. • E( estimated βj) = βj for all j. • A4 (Homoskedasticity in errors) Var(error t |X) = σ^2 (i.e. independent of X), t = 1, 2, · · ·. • A5 (No serial correlation in errors) Corr (errort , errors |X) = 0, for t not equal to s. • Under A1-A5, the OLS estimators are the minimum-variance linear unbiasedestimators conditional on X.
A6 (Normality of errors) utare are independentlyand identically distributed as N (0, σ^2). • Under A1-A6, the OLS estimators are normally distributedconditional on X. And t-statistic on parameters and F-statistic on the R^2 can be used for different statistical reasoning. • A3-A6 are usually assumed to be true unless there is significant evidence/ reason against them.
Early models for classification • As our main target is make prediction in data mining, the dependent variable is usually nominal/ ordinal/ binary in nature. Usually we use a binary y to represent this, i.e. y=1 for yes and 0 for no. • An early model is the linear probability model, which regress binary y on other explanatory variable X. As y is binary, the predicted value is usually around the range 0 and 1. So people used this model to predict the probability for an event. • However, such model violates A3, A4 and A6. Also, the predicted value could be out of the range 0 and 1. The model become not so useful.
The problem could be rectified by introducing a threshold such that when the predicted y is greater than the threshold, we classify y as 1. This become the most simple neural network model, which will be introduced later. • However, what we obtain become a decision rather than a probability, which might be useful in some cases. Also, the relation between the probability and the explanatory variable become less clear. • Statisticians invented logistic regression to solve the problem.
Logistic Regression • The idea is to use a 1 to 1 mapping to map the probability from range between [0,1] to all real numbers. Then, there will be no problem no matter what the right hand side is. • 3 common transformation/ link function (provided by SAS): • Logit : ln(p/1-p) (We call this log of odd ratio) • Probit: Normal inverse of p (Recall: normal table’s mapping scheme) • Complementary log-log: ln(-ln(1-p)) • The choice of link function depends on your purpose rather than performance. They all perform equally good but the implications is a bit different.
However, as the model is no longer in linear form, ordinary least square cannot be used. Furthermore, if we put y directly into transformation, we get positive/negative infinity. • We use Maximum Likelihood Estimator (MLE) methods instead. In which we choose beta coefficients that maximize the probability that the data as we see now. • MLE needs fewer assumptions than OLS, but much less inference could be made, especially for logistic regression. • Also, as both MLE and OLS use only one beta coefficient to describe the effect of an explanatory variable brings about, data scaling/ normalization is particular important.
Example on Logit • Assume we believe the relation between probability p of an event is “yes” and independent variable x can be described by the equation ln(p(x)/1-p(x)) = a+bx Then, p(x) = exp(a+bx) / [1+exp(a+bx)] If we have 4 data points :(Yes,x1) ,(No,x2) ,(Yes,x3), (No,x4) and assume they are mutually independent , then the probability that we see these 4 data point is the product: p(x1)[1-p(x2)]p(x3 )[1-p(x4)] and MLE tries to maximize this by choosing suitable a and b.
Reading the Report • Akaike’s Information Criteria (AIC) and Schwarz’s Bayesian Criteria (SBC) : (Compare to: F-test on Adjusted R^2 for OLS) - both has smaller value for higher maximized likelihood, and higher value if more explanatory variable is used (to penalize over-fitting). - So smaller of it is preferred. (though is not the only consideration for choosing model) • T-score (Compare to: t-test on estimated betas for OLS) It is the estimate divided by its standard error. We may treat it like t-test as in OLS, and construct a confidence interval for the betas. But in practice, it works only asymptotically. We just consider large t-score as an indicator for possibly significant effect but no hypothesis testing could be done.
Wald’s Chi-square (Compare to t-test for OLS) We could treat an effect as significant if the tail probability is small enough (< 5%). • If we are using the model for predicting the outcome rather than the probability for that outcome (the case when the criterion is set to minimize loss), the interpretation for misclassification rate/ profit and loss/ ROC curve/ lift chart is similar to those for decision tree. • Some scholars suggest prediction interval for the probability P of the event given independent variable be Pestimated+ Z1-a/2 [Pestimated (1-Pestimated)/#data]^(1/2) Z being the Z-score for normal table and a being the significance level. But we do not have this in SAS.
The interpretation for the model form is similar for OLS by techniques like differentiation and differencing. • One common use is, for Logit model with form: f(x) = ln(P(x)/1-P(x)) = a+bx, x being binary f(1) = a+b, f(0)= a f(1)/f(0) ~ ln(P(1)/P(0)) = b for small P(0), P(1) P(1) = exp(b) * P(0) Hence P(1) is exp(b) as big as P(0). We can draw conclusion like “Having something (x) done increases the probability to exp(b) times for not having it done”