Lecture4,5 – Linear Regression

Rice ELEC 697 Farinaz Koushanfar Fall 2006

Lecture4,5 – Linear Regression

Presentation Transcript

  1. Lecture4,5 – Linear Regression Rice ELEC 697 Farinaz Koushanfar Fall 2006

  2. Summary • The simple linear regression model • Confidence intervals • Multiple linear regression • Model selection and shrinkage methods • Homework 0

  3. Preliminaries • Data streams X and Y, forming the measurement tuples (x1,y1), …., (xN,yN) • xi is the predictor (regressor, covariate, feature, independent variable) • yi is the response (dependent variable, outcome) • Denote the regression functionby: (x) = E (Y |x) • The linear regression model assumes a specific linear form for (x) = 0 + 1 x

  4. Linear Regression - More • ’s are unknown parameters (coefficients) • X’s can come from different sources: • Quantitative inputs • Transformations of quantitative inputs • Basis expansions, e.g. X2=X12, X3=X13 • Numeric or “dummy” coding of the levels of qualitative inputs • Interaction between variables, e.g. X3=X1.X2 • No matter the source of X, the model is linear in parameters

  5. Fitting by Least Squares • For (x1,y1), …., (xN,yN), estimate ’s! • xi=(xi1,xi2,…, xip)T is a vector of feature measures for the i-th case • Least square:

  6. How to Find Least Square Fits? • X is N(p+1), y is the N-vector of outputs • RSS() = (y - X)T (y - X) • If X is full rank, then XTX is PD, set H (hat matrix)

  7. Geometrical Representation • Least square estimates in N: • Minimize RSS()=||y - X||2, s.t. residual vector y – ŷ is orthogonal to this subspace • H (hat matrix a.k.a projection matrix)

  8. Non-full Rank Case! • What if columns of X are linearly dependent! • XTX is singular, ’s not uniquely defined • ŷ is still projection of y into columns of X • There is just more than one way for projection! • Resolve by dropping redundant columns of X • Most SW packages detect and automatically implement some strategy to remove them

  9. Sampling Properties of  • yi’s uncorrelated - variance yi= 2, xi’s fixed • We use an unbiased estimator for  • Y is linear in X1,…, Xp. Deviations are linear and additive, N(0, 2) • We can show that:

  10. Confidence Intervals • Assume that our population parameter of interest is the population mean • What is the meaning of a 95% confidence intervalin this situation? • X (wrong) there is a 95% chance that the confidence interval contains the population mean • But any particular confidence interval either contains the population mean, or it doesn’t. The confidence interval shouldn’t be interpreted as a probability. •  (correct) If samples of the same size are drawn repeatedly from a population, and a confidence interval is calculated from each sample, then 95% of these intervals should contain the population mean.

  11. But Before We Proceed… • Let’s talk some real probability and statistics, refresh your memories about a number of useful distributions!

  12. Chi-square Distribution • For Z1,Z2,…,Z independent, Z~N(0,1) • Set 2= Z12+Z22+…+Z2with > 0 • The distribution of 2 is chi-squarewith “degrees of freedom” (DF) • 2(x) denotes the value of the distribution function at x • p, denotes the p-th quantile of the distribution • Usually use look-up tables to find the values • This is a 2 distribution with (N-p-1) DF

  13. Example – Chi-square Distribution

  14. t-distribution • Let Z and 2 be independent random variables, assume further that • Z has the standard normal distribution • 2 has the chi-square distribution with  degrees of freedom • Define t as: • The distribution of t is referred to as t-distribution (student’s t-distribution) with  “degrees of freedom” • The value of the corresponding distribution function is denoted by t(x) and its p-th quantile is denoted by tp, (0p1)

  15. Example – t-distribution

  16. Tail of t-distribution vs. Normal

  17. F-distribution • Let 12,22 be independent random variables, assume further that • 12 has the chi-square distribution with 1 DF • 22 has the chi-square distribution with 2 DF • Define F as: • F distribution with 1 DF in the numerator and 2 DF in the denomenator. • F1,2(x): distribution function at x, p-th quantile is Fp,1,2 • F is always positive and F1,2(-)= F1,2(0)=0, quantiles>0

  18. Example – F-distribution

  19. Confidence Interval (CI) • CI is an estimated range of values which is likely to include an unknown population parameter, • The estimated range calculated from a given set of sample data • If independent samples are taken repeatedly from the same population, and CI calculated for each sample,  a certain percentage (confidence level) of the intervals will include the unknown population parameter.

  20. Hypothesis Testing • 2 distribution, (N-p-1) DF: •  and 2 statistically independent • Hypothesis: Is j=0? • Standard coefficient or Z-score: • vj: j-th diagonal element of (XTX)-1 • If j=0, zj is distributed as tN-p-1 (t distribution, N-p-1 DF) • Enter the t-table for N-p-1 DF, choose the significance level () required, and find the tabulated value • If the zjexceeds the tabulated value, then j is significantly different from zero and the hypothesis is rejected!

  21. Test the Significance of a Group • Simultaneously test the significance of a group of ’s • F-statistics: • RSS1 is for the bigger model with p1+1 • RSS0 is for the nested model with p0+1, having p1-p0 parameters constrained to be zero • The F-stat measures the change in RSS, normalized by an estimate of 2 • If the smaller model is correct, F-stat~Fp1-p0,N-p1-1

  22. Confidence Intervals (CI) • Recall that • Isolate j and get the (1-2) CI for j: • Where z(1- ) is 1-  percentile of the normal • Similarly, obtain CI for entire vector  • Where 2(1-) is the (1-) percentile of the 2 distribution with  DF

  23. The Gauss-Markov Theorem • The least square estimates of the parameters  have the smallest variance among all linear unbiased estimates • Restriction to unbiased estimation is not always the best • Focus on a linear combination =aT of  • For example, prediction f(X0)=X0 • For a fixed X=X0, this is a linear function =CY

  24. The Gauss-Markov Theorem - Proof • =CY: Since the new estimator is unbiased • E(CY|X) = E(C X+C  |X) =   CX=I • Var(|X)=Var(CTCT|X) =2CCT • Need to show Var(|X)>Var(ols|X) • Define D=C - (XTX)-1XT 0r, DY = - ols • I=CX=(D+(XTX)-1XT)X=DX+(XTX)-1 XTXDX=0 • Var(|X)=2CCT= 2[D+(XTX)-1XT][DT+(XTX)-1] = 2[DDT+(XTX)-1],DDT is positive semidefinite  • Var(|X)>Var(ols|X), except for D=0, where = ols

  25. Multiple Outputs • Multiple outputs Y1, Y2, …, YK from multiple inputs X0,X1,X2,…,Xp • For each output: • Y=XB+E (Y(NK), X(N(p+1)), E(NK)) • The least square estimates: • Note:If the errors were correlated as Cov ()=,

  26. The Bias Variance Trade-off • A good measure of the quality of an estimator f*(x) is the MSE • Let f0(x) be the true value of f(x) at the point x. Then: • MSE[f*(x)] = E[f*(x)-f0(x)]2.This can be written as: • MSE[f*(x)] = Var[f*(x)] + [E (f*(x) – f0(x))]2 • This is varianceplus squared bias Typically, when bias is low, variance is high and vice versa. Choosing estimators often involves a trade-off between bias and variance

  27. Linear Methods for Regression • If the linear model is correct for a given problem, then the OLS prediction is unbiased, and has the lowest variance among all linear unbiased estimators • But there can be (and often exist) biased estimators with smaller MSE • Generally, by regularizing (thinking, dampening, controlling) the estimator in some way, its variance will be reduced; if the corresponding increase in bias is small, this will be worthwhile • Examples of regularization: subset selection (forward, backward, all subsets): ridge regression, the lasso • In reality, models are almost never correct, so there is an additional model bias between the closest member of the linear model class and the truth

  28. Model Selection • Often we prefer a restricted estimate because of its reduced estimation variance

  29. Subset Selection and Coefficient Shrinkage • There are two reasons why we are often not satisfied with the least squares estimates: • Prediction accuracy: it can be sometimes improved by trading a little bias to reduce the variance of the predicted values • Interpretation: determine a smaller subset of predictors that exhibit the strongest effect. To get the “big picture” sacrifice some of the small details • Let us describe a number of approaches to variable selection and coefficient shrinkage…

  30. Subset Selection - Linear Model • Best subset selection:Finds thesubset of size k that gives the smallest RSS • Leaps and bounds procedure (Furnival and Wilson, 1974) • sort of an exhaustive search, all possible combinations! • Let’s take a look again at prostate cancer exp: • Correlation b/w the level of prostate specific antigen (PSA) and clinical predictors • We use log(PSA) - lpsa as the response variable • Denote: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45

  31. Prostate Cancer Data

  32. All the Subset Models for PC Example

  33. Subset Selection - Linear Model • Exhaustive search (best subset selection) becomes infeasible for large p • Forward stepwise selection: start from intercept and sequentially add variables • For k inputs  , for 1+k inputs  • Typically, add predictors producing largest value of F, stopping when cannot produce F-ratio greater than 90-95th percentile of the F1,N-k-2 distribution

  34. Subset Selection - Linear Model • Backward stepwise selection: start with a full model and sequentially delete predictors • Again, typically uses the F-ratio as stopping criteria • Drops predictors producing the smallest F at each stage, stopping when each of the predictors in the model produces a value of F greater than 90-95th percentile when dropped • There are also hybrid strategies that simultaneously consider both forward and backward selection • Note: this is only a local search, because we are not considering all the combinations, just the sequential combination of variables

  35. K-fold Cross Validation • Primary method for estimating a tuning parameter (such as subset size) • Divide data into K roughly equal parts (typically K=5 or 10) • For each k=1, 2,. . .K, fit the model with  to the other K−1parts, giving and compute its error in predicting the k-th part: • This gives the cross-validation error • Do this for many ’s, choose  that makes CV() smallest.

  36. K-fold Cross Validation Notations • In our variable subsets example, is the subset size • are the coefficients for the best subset of size , found from the training set that leaves out the kth part of the data • Ek() is the estimated test error for this best subset. • From the K cross-validation training sets, the K test error estimates are averaged to give • Note that different subsets of size will (probably) be found from each of the K cross-validation training sets. Doesn’t matter: focus is on subset size, not the actual subset.

  37. The Bootstrap Approach • Bootstrap works by sampling N times with replacement from training set to form a “bootstrap” data set. Then model is estimated on bootstrap data set, and predictions are made for original training set • This process is repeated many times and the results are averaged • Bootstrap is most useful for estimating standard errors of predictions • Can also use modified versions of the bootstrap to estimate prediction error • Sometimes produces better estimates than cross-validation (topic for current research)

  38. Shrinkage Methods – Ridge Reg. • Because subset selection is a discrete process, it often produces a high variance model • Shrinkage methods are more continuous • Ridge regression: Ridge coefficient minimize a penalized RSS: • Equivalently • The parameter >0 penalizes j proportional to its size j2. Solution is • where I is the identity matrix. This is a biased estimator that for some value of >0 may have smaller mean squared error than the least squares estimator. Note =0 gives the least squares estimator; if  then 0

  39. Prostate Cancer Example (Cont’d)

  40. Shrinkage Methods – The Lasso • The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y • The lasso is defined by: • Notice that ridge penalty j2 is replaced by |j| • This makes the solution nonlinear in y, and a quadratic programming algorithm is used to compute them • Because of the nature of the constraint, if t is chosen small enough then the lasso will set some coefficients to zero. Thus lasso does a kind of continuous model selection

  41. Shrinkage Methods • The parameter t should be adaptively chosen to minimize an estimate of expected, using say cross-validation • Ridge vs Lasso:if inputs are orthogonal, ridge multipliesleast squares coefficients by a constant < 1, lasso translatesthem towards zero by a constant, truncating at zero

  42. Prostate Cancer Example (Cont’d)

  43. Lagrange Multiplier -- Concept • Solving an optimization problem • finding a min or max, e.g., min f(P) • Not closed form, subject to constraints (e.g. g(P)=0) • If no constraints, you would set grad(f(P))=0 • With constraints, define F(P,)=f(P)- g(P) • Now, set the gradient of F to 0, grad(F(P, ))=0 • One more dim, partial derivative w.r.t =0  g(P)=0 • Thus, grad(F)=0 automatically satisfies our constraint • Add a new Lagrange multiplier() for each constraint

  44. Lagrange Multiplier – Geometrical Example • In 2D, assume, OF: min f(x,y) • Constraint: g(x,y)-c=0 • Traverse along g(x,y)-c=0 • May intersect with f(x,y)=dn at many points • Only when g=c, and f(x,y) touches tangentially, but does not cross g, we will find min • For more info, see Lagrange Multipliers without Permanent Scarring (by Dan Klein) From wikipedia: Drawn in green is the locus of points satisfying the constraint g(x,y) = c. Drawn in blue are contours of f. Arrows represent the gradient, which points in a direction normal to the contour.

  45. Principle Component Analysis (PCA) Intro – eigenvlues, eigenvectors • A square matrix can be interpreted as a transformation – translation, rotation, stretching, etc. • Eigenvectors of transformations are vectors that are either left intact or simply multiplied by a scalar factor after the transformation • An eigenvector's eigenvalue is the scale factor by which it has been multiplied • E.g., matrix A, eigenvector u, eigen value  • By definition, Au = u • To find, set Au - u =0, det(A- I)=0, sys of linear eq.

  46. PCA Intro – Singular Value Decomposition (SVD) • The covariance matrix of a matrix XNp, pN, is a pp matrix (square), whose elements are covariance between column pairs of X • Singular value decomposition: matrix X can be written as X=UDVT • U and V are Np and pp orthogonal matrices: • Columns of U spanning the column space of X, UTU=INN • Columns of V spanning the row space, VTV=Ipp • D is a pp diagonal matrix, with diagonal entries d1d2… dp 0 called the singular values of X • The SVD always exists, and is unique up to signs

  47. PCA Intro – SVD, correlation matrix • SVD finds the eigenvalues and eigenvectors of XXTand XTX • The eigenvectors of XTX make up the columns of V, the eigenvectors of XXTmake up the columns of U. • XTX= VDTUTUDVT=VD2VT, V is the eigenvectors of the covariance matrix • The singular values in D (diagonal entries of D) are square roots of eigenvalues from XTX and are arranged in descending order • The singular values are always real numbers. If the matrix Xis a real matrix, then Uand Vare also real

  48. PCA and dimension reduction • The eigenvectors vj are the principle component directions of X • The first PC is z1=Xv1=u1d1 has the largest variance among all normalized directions, var(z1)=d12/N • Subsequent zj’s have max variance dj2/N subject to the constraint of being orthogonal to earlier ones • If we have a large number of correlated inputs, we can produce a small number of linear combinations Zm, m=1,…,M (M<P) of the original inputs Xj, then use Zm’s as regression inputs

  49. Methods Using Derived Input Directions • Principle Component Regression:

  50. PCA Regression • Let Dq be D, with all but the first q diagonal elements set to zero. Then • Write q(j) for the ordered principle components, ordered from largest to smallest value of dj2 • Then principle component regression computes the derived input columns zj=Xq(j) and then regresses y on z1,z2,…,zJ for some Jp • Since the zj’s are orthogonal, this regression is just a sum of univariate regressions: Where is the univariate regression coefficient of y on zj

