1 / 56

Lecture4,5 – Linear Regression

Lecture4,5 – Linear Regression. Rice ELEC 697 Farinaz Koushanfar Fall 2006. Summary. The simple linear regression model Confidence intervals Multiple linear regression Model selection and shrinkage methods Homework 0. Preliminaries.

calum
Download Presentation

Lecture4,5 – Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture4,5 – Linear Regression Rice ELEC 697 Farinaz Koushanfar Fall 2006

  2. Summary • The simple linear regression model • Confidence intervals • Multiple linear regression • Model selection and shrinkage methods • Homework 0

  3. Preliminaries • Data streams X and Y, forming the measurement tuples (x1,y1), …., (xN,yN) • xi is the predictor (regressor, covariate, feature, independent variable) • yi is the response (dependent variable, outcome) • Denote the regression functionby: (x) = E (Y |x) • The linear regression model assumes a specific linear form for (x) = 0 + 1 x

  4. Linear Regression - More • ’s are unknown parameters (coefficients) • X’s can come from different sources: • Quantitative inputs • Transformations of quantitative inputs • Basis expansions, e.g. X2=X12, X3=X13 • Numeric or “dummy” coding of the levels of qualitative inputs • Interaction between variables, e.g. X3=X1.X2 • No matter the source of X, the model is linear in parameters

  5. Fitting by Least Squares • For (x1,y1), …., (xN,yN), estimate ’s! • xi=(xi1,xi2,…, xip)T is a vector of feature measures for the i-th case • Least square:

  6. How to Find Least Square Fits? • X is N(p+1), y is the N-vector of outputs • RSS() = (y - X)T (y - X) • If X is full rank, then XTX is PD, set H (hat matrix)

  7. Geometrical Representation • Least square estimates in N: • Minimize RSS()=||y - X||2, s.t. residual vector y – ŷ is orthogonal to this subspace • H (hat matrix a.k.a projection matrix)

  8. Non-full Rank Case! • What if columns of X are linearly dependent! • XTX is singular, ’s not uniquely defined • ŷ is still projection of y into columns of X • There is just more than one way for projection! • Resolve by dropping redundant columns of X • Most SW packages detect and automatically implement some strategy to remove them

  9. Sampling Properties of  • yi’s uncorrelated - variance yi= 2, xi’s fixed • We use an unbiased estimator for  • Y is linear in X1,…, Xp. Deviations are linear and additive, N(0, 2) • We can show that:

  10. Confidence Intervals • Assume that our population parameter of interest is the population mean • What is the meaning of a 95% confidence intervalin this situation? • X (wrong) there is a 95% chance that the confidence interval contains the population mean • But any particular confidence interval either contains the population mean, or it doesn’t. The confidence interval shouldn’t be interpreted as a probability. •  (correct) If samples of the same size are drawn repeatedly from a population, and a confidence interval is calculated from each sample, then 95% of these intervals should contain the population mean.

  11. But Before We Proceed… • Let’s talk some real probability and statistics, refresh your memories about a number of useful distributions!

  12. Chi-square Distribution • For Z1,Z2,…,Z independent, Z~N(0,1) • Set 2= Z12+Z22+…+Z2with > 0 • The distribution of 2 is chi-squarewith “degrees of freedom” (DF) • 2(x) denotes the value of the distribution function at x • p, denotes the p-th quantile of the distribution • Usually use look-up tables to find the values • This is a 2 distribution with (N-p-1) DF

  13. Example – Chi-square Distribution

  14. t-distribution • Let Z and 2 be independent random variables, assume further that • Z has the standard normal distribution • 2 has the chi-square distribution with  degrees of freedom • Define t as: • The distribution of t is referred to as t-distribution (student’s t-distribution) with  “degrees of freedom” • The value of the corresponding distribution function is denoted by t(x) and its p-th quantile is denoted by tp, (0p1)

  15. Example – t-distribution

  16. Tail of t-distribution vs. Normal

  17. F-distribution • Let 12,22 be independent random variables, assume further that • 12 has the chi-square distribution with 1 DF • 22 has the chi-square distribution with 2 DF • Define F as: • F distribution with 1 DF in the numerator and 2 DF in the denomenator. • F1,2(x): distribution function at x, p-th quantile is Fp,1,2 • F is always positive and F1,2(-)= F1,2(0)=0, quantiles>0

  18. Example – F-distribution

  19. Confidence Interval (CI) • CI is an estimated range of values which is likely to include an unknown population parameter, • The estimated range calculated from a given set of sample data • If independent samples are taken repeatedly from the same population, and CI calculated for each sample,  a certain percentage (confidence level) of the intervals will include the unknown population parameter.

  20. Hypothesis Testing • 2 distribution, (N-p-1) DF: •  and 2 statistically independent • Hypothesis: Is j=0? • Standard coefficient or Z-score: • vj: j-th diagonal element of (XTX)-1 • If j=0, zj is distributed as tN-p-1 (t distribution, N-p-1 DF) • Enter the t-table for N-p-1 DF, choose the significance level () required, and find the tabulated value • If the zjexceeds the tabulated value, then j is significantly different from zero and the hypothesis is rejected!

  21. Test the Significance of a Group • Simultaneously test the significance of a group of ’s • F-statistics: • RSS1 is for the bigger model with p1+1 • RSS0 is for the nested model with p0+1, having p1-p0 parameters constrained to be zero • The F-stat measures the change in RSS, normalized by an estimate of 2 • If the smaller model is correct, F-stat~Fp1-p0,N-p1-1

  22. Confidence Intervals (CI) • Recall that • Isolate j and get the (1-2) CI for j: • Where z(1- ) is 1-  percentile of the normal • Similarly, obtain CI for entire vector  • Where 2(1-) is the (1-) percentile of the 2 distribution with  DF

  23. The Gauss-Markov Theorem • The least square estimates of the parameters  have the smallest variance among all linear unbiased estimates • Restriction to unbiased estimation is not always the best • Focus on a linear combination =aT of  • For example, prediction f(X0)=X0 • For a fixed X=X0, this is a linear function =CY

  24. The Gauss-Markov Theorem - Proof • =CY: Since the new estimator is unbiased • E(CY|X) = E(C X+C  |X) =   CX=I • Var(|X)=Var(CTCT|X) =2CCT • Need to show Var(|X)>Var(ols|X) • Define D=C - (XTX)-1XT 0r, DY = - ols • I=CX=(D+(XTX)-1XT)X=DX+(XTX)-1 XTXDX=0 • Var(|X)=2CCT= 2[D+(XTX)-1XT][DT+(XTX)-1] = 2[DDT+(XTX)-1],DDT is positive semidefinite  • Var(|X)>Var(ols|X), except for D=0, where = ols

  25. Multiple Outputs • Multiple outputs Y1, Y2, …, YK from multiple inputs X0,X1,X2,…,Xp • For each output: • Y=XB+E (Y(NK), X(N(p+1)), E(NK)) • The least square estimates: • Note:If the errors were correlated as Cov ()=,

  26. The Bias Variance Trade-off • A good measure of the quality of an estimator f*(x) is the MSE • Let f0(x) be the true value of f(x) at the point x. Then: • MSE[f*(x)] = E[f*(x)-f0(x)]2.This can be written as: • MSE[f*(x)] = Var[f*(x)] + [E (f*(x) – f0(x))]2 • This is varianceplus squared bias Typically, when bias is low, variance is high and vice versa. Choosing estimators often involves a trade-off between bias and variance

  27. Linear Methods for Regression • If the linear model is correct for a given problem, then the OLS prediction is unbiased, and has the lowest variance among all linear unbiased estimators • But there can be (and often exist) biased estimators with smaller MSE • Generally, by regularizing (thinking, dampening, controlling) the estimator in some way, its variance will be reduced; if the corresponding increase in bias is small, this will be worthwhile • Examples of regularization: subset selection (forward, backward, all subsets): ridge regression, the lasso • In reality, models are almost never correct, so there is an additional model bias between the closest member of the linear model class and the truth

  28. Model Selection • Often we prefer a restricted estimate because of its reduced estimation variance

  29. Subset Selection and Coefficient Shrinkage • There are two reasons why we are often not satisfied with the least squares estimates: • Prediction accuracy: it can be sometimes improved by trading a little bias to reduce the variance of the predicted values • Interpretation: determine a smaller subset of predictors that exhibit the strongest effect. To get the “big picture” sacrifice some of the small details • Let us describe a number of approaches to variable selection and coefficient shrinkage…

  30. Subset Selection - Linear Model • Best subset selection:Finds thesubset of size k that gives the smallest RSS • Leaps and bounds procedure (Furnival and Wilson, 1974) • sort of an exhaustive search, all possible combinations! • Let’s take a look again at prostate cancer exp: • Correlation b/w the level of prostate specific antigen (PSA) and clinical predictors • We use log(PSA) - lpsa as the response variable • Denote: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45

  31. Prostate Cancer Data

  32. All the Subset Models for PC Example

  33. Subset Selection - Linear Model • Exhaustive search (best subset selection) becomes infeasible for large p • Forward stepwise selection: start from intercept and sequentially add variables • For k inputs  , for 1+k inputs  • Typically, add predictors producing largest value of F, stopping when cannot produce F-ratio greater than 90-95th percentile of the F1,N-k-2 distribution

  34. Subset Selection - Linear Model • Backward stepwise selection: start with a full model and sequentially delete predictors • Again, typically uses the F-ratio as stopping criteria • Drops predictors producing the smallest F at each stage, stopping when each of the predictors in the model produces a value of F greater than 90-95th percentile when dropped • There are also hybrid strategies that simultaneously consider both forward and backward selection • Note: this is only a local search, because we are not considering all the combinations, just the sequential combination of variables

  35. K-fold Cross Validation • Primary method for estimating a tuning parameter (such as subset size) • Divide data into K roughly equal parts (typically K=5 or 10) • For each k=1, 2,. . .K, fit the model with  to the other K−1parts, giving and compute its error in predicting the k-th part: • This gives the cross-validation error • Do this for many ’s, choose  that makes CV() smallest.

  36. K-fold Cross Validation Notations • In our variable subsets example, is the subset size • are the coefficients for the best subset of size , found from the training set that leaves out the kth part of the data • Ek() is the estimated test error for this best subset. • From the K cross-validation training sets, the K test error estimates are averaged to give • Note that different subsets of size will (probably) be found from each of the K cross-validation training sets. Doesn’t matter: focus is on subset size, not the actual subset.

  37. The Bootstrap Approach • Bootstrap works by sampling N times with replacement from training set to form a “bootstrap” data set. Then model is estimated on bootstrap data set, and predictions are made for original training set • This process is repeated many times and the results are averaged • Bootstrap is most useful for estimating standard errors of predictions • Can also use modified versions of the bootstrap to estimate prediction error • Sometimes produces better estimates than cross-validation (topic for current research)

  38. Shrinkage Methods – Ridge Reg. • Because subset selection is a discrete process, it often produces a high variance model • Shrinkage methods are more continuous • Ridge regression: Ridge coefficient minimize a penalized RSS: • Equivalently • The parameter >0 penalizes j proportional to its size j2. Solution is • where I is the identity matrix. This is a biased estimator that for some value of >0 may have smaller mean squared error than the least squares estimator. Note =0 gives the least squares estimator; if  then 0

  39. Prostate Cancer Example (Cont’d)

  40. Shrinkage Methods – The Lasso • The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y • The lasso is defined by: • Notice that ridge penalty j2 is replaced by |j| • This makes the solution nonlinear in y, and a quadratic programming algorithm is used to compute them • Because of the nature of the constraint, if t is chosen small enough then the lasso will set some coefficients to zero. Thus lasso does a kind of continuous model selection

  41. Shrinkage Methods • The parameter t should be adaptively chosen to minimize an estimate of expected, using say cross-validation • Ridge vs Lasso:if inputs are orthogonal, ridge multipliesleast squares coefficients by a constant < 1, lasso translatesthem towards zero by a constant, truncating at zero

  42. Prostate Cancer Example (Cont’d)

  43. Lagrange Multiplier -- Concept • Solving an optimization problem • finding a min or max, e.g., min f(P) • Not closed form, subject to constraints (e.g. g(P)=0) • If no constraints, you would set grad(f(P))=0 • With constraints, define F(P,)=f(P)- g(P) • Now, set the gradient of F to 0, grad(F(P, ))=0 • One more dim, partial derivative w.r.t =0  g(P)=0 • Thus, grad(F)=0 automatically satisfies our constraint • Add a new Lagrange multiplier() for each constraint

  44. Lagrange Multiplier – Geometrical Example • In 2D, assume, OF: min f(x,y) • Constraint: g(x,y)-c=0 • Traverse along g(x,y)-c=0 • May intersect with f(x,y)=dn at many points • Only when g=c, and f(x,y) touches tangentially, but does not cross g, we will find min • For more info, see Lagrange Multipliers without Permanent Scarring (by Dan Klein) From wikipedia: Drawn in green is the locus of points satisfying the constraint g(x,y) = c. Drawn in blue are contours of f. Arrows represent the gradient, which points in a direction normal to the contour.

  45. Principle Component Analysis (PCA) Intro – eigenvlues, eigenvectors • A square matrix can be interpreted as a transformation – translation, rotation, stretching, etc. • Eigenvectors of transformations are vectors that are either left intact or simply multiplied by a scalar factor after the transformation • An eigenvector's eigenvalue is the scale factor by which it has been multiplied • E.g., matrix A, eigenvector u, eigen value  • By definition, Au = u • To find, set Au - u =0, det(A- I)=0, sys of linear eq.

  46. PCA Intro – Singular Value Decomposition (SVD) • The covariance matrix of a matrix XNp, pN, is a pp matrix (square), whose elements are covariance between column pairs of X • Singular value decomposition: matrix X can be written as X=UDVT • U and V are Np and pp orthogonal matrices: • Columns of U spanning the column space of X, UTU=INN • Columns of V spanning the row space, VTV=Ipp • D is a pp diagonal matrix, with diagonal entries d1d2… dp 0 called the singular values of X • The SVD always exists, and is unique up to signs

  47. PCA Intro – SVD, correlation matrix • SVD finds the eigenvalues and eigenvectors of XXTand XTX • The eigenvectors of XTX make up the columns of V, the eigenvectors of XXTmake up the columns of U. • XTX= VDTUTUDVT=VD2VT, V is the eigenvectors of the covariance matrix • The singular values in D (diagonal entries of D) are square roots of eigenvalues from XTX and are arranged in descending order • The singular values are always real numbers. If the matrix Xis a real matrix, then Uand Vare also real

  48. PCA and dimension reduction • The eigenvectors vj are the principle component directions of X • The first PC is z1=Xv1=u1d1 has the largest variance among all normalized directions, var(z1)=d12/N • Subsequent zj’s have max variance dj2/N subject to the constraint of being orthogonal to earlier ones • If we have a large number of correlated inputs, we can produce a small number of linear combinations Zm, m=1,…,M (M<P) of the original inputs Xj, then use Zm’s as regression inputs

  49. Methods Using Derived Input Directions • Principle Component Regression:

  50. PCA Regression • Let Dq be D, with all but the first q diagonal elements set to zero. Then • Write q(j) for the ordered principle components, ordered from largest to smallest value of dj2 • Then principle component regression computes the derived input columns zj=Xq(j) and then regresses y on z1,z2,…,zJ for some Jp • Since the zj’s are orthogonal, this regression is just a sum of univariate regressions: Where is the univariate regression coefficient of y on zj

More Related