1 / 20

Linear Methods for Regression (2)

Linear Methods for Regression (2). Yi Zhang, Kevyn Collins-Thompson Advanced Statistical Learning Seminar 11741 fall 2002. What We Have Discussed . Lecture 1 (Kevyn): Unrestricted models Linear Regression: Least-squares estimate Confidence of Parameter Estimates Gauss-Markov Theorem

dotty
Download Presentation

Linear Methods for Regression (2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Methods for Regression (2) Yi Zhang, Kevyn Collins-Thompson Advanced Statistical Learning Seminar 11741 fall 2002

  2. What We Have Discussed • Lecture 1 (Kevyn): Unrestricted models • Linear Regression: Least-squares estimate • Confidence of Parameter Estimates • Gauss-Markov Theorem • Multiple Regression in terms of Univariates

  3. Outline • Subset selection (feature selection) • Coefficient Shrinkage (smoothing) • Ridge Regression • Lasso • Using derived input direction • Principal component regression • Partial Least Squares • Compare subset selection with shrinkage • Multiple outcome shrinkage and selection

  4. Subset Selection and Shrinkage: Motivation • Bias Variance Trade Off • Goal: choose model to minimize error • Method: sacrifice a little bit of bias to reduce the variance • Better interpretation: find the strongest factors from the input space

  5. Subset Selection • Produces model that is interpretable and has possibly lower prediction error. • Forces some dimensions of x to zero, thus probably decrease

  6. Subset Selection Methods • Find the global optimal model: Best subset regression (too computationally expensive) • Greedy search for the optimal model (practical): • Forward stepwise selection • Begin with empty set, and sequentially adds predictors • Backward stepwise selection • Begin with full model, and sequentially deletes predictors • Stepwise selection: combination of forward and backward move

  7. Adding/Dropping Feature Criteria • Goal: Minimize RSS() • F-test • Tests hypothesis that two samples have different variances • Forward selection: Use F-test to find the feature that decreases RSS() most, and add it to the feature set • Backward selection: Use F-test to find the feature that increases RSS() least, and delete it from the feature set

  8. Shrinkage • Intuition: continuous version of subset selection • Goal: imposing penalty on complexity of model to get lower variance • Two example: • Ridge regression • Lasso

  9. Ridge Regression • Penalize by sum-of-squares of parameters • Or

  10. Understanding of Ridge Regression • Find the orthogonal principal components (basis vectors), and then apply greater amount of shrinkage to basis vectors with small variance. • Assumption: y vary most in the directions of high variance • Intuitive example: stop words in text classification if assuming no covariance between words • Relates to MAP Estimation If:  ~ N(0, I) , y ~ N(X, 2I) Then:

  11. Lasso • Penalize by absolute value of parameter

  12. Using Derived Input Directions • Goal: Using linear combinations of inputs as inputs in the regression • Usually the derived input directions are orthogonal to each other • Principle component regression • Get vm using SVD • Use as inputs in the regression

  13. More questions • (Jian’s question) Q: Chap3 talks about how to balance the tradeoff between Bias and Variance.However, all those works are based on a known distribution (linear inmost cases). What could we do to balance them if we do NOT know the distribution, which is more common in reality? How about break down into local regressions/kernel ones, and compute and combine them?(Then, how many bins? How to choose the roughness penalty?) A: Try linear first because it is simple and normal. It can represent most of the relationships and contain less variance (fewer parameters compared to complicated model). Also you can try kernel methods without model assumptions: none parametric models such as KNN. More we will learn in the later classes. NN: semi-parametric model…

  14. Partial Least squares • Idea: find directions that have high variance and have high correlation with y • In the construction of each zm, the inputs are weighted by the strength of their univariate effect on y • Step 1: • Step 2: regress y on z1 • Step 3: orthoganize x1,x2,..xp with respect to z1 • Continue on step 1, get z1,z2,..zM

  15. PCR discards the smallest eigenvalue components (low-variance direction). The mth component vm solves: PLS shrink the low-variance direction, while inflate high variance direction. The mth component vm solves: Ridge Regression: Shrinks coefficients of the principle components. Low-variance direction is shrinked more PCR vs. PLS vs. Ridge Regression

  16. Compare Selection and Shrinkage PCR PLS Least Squares ridge lasso Best subset

  17. Multiple Outcome Shrinkage and Selection • Option 1: do not consider the correlation in different outcomes, and apply single outcome shrinkage and selection to each outcome • Option 2: Exploit correlations in different outcomes

  18. Canonical Correlation Analysis • Derived input and outcome space based on canonical correlation analysis (CCA) that maximize

  19. Reduced Rank Regression • Regression in derived directions • Step 1: Map y into derived directions • Step 2: Do regression in the derived space • Step 3: Map back to y’s origianl space

  20. Summary • Bias Variance trade off: • Subset selection (feature selection, discrete) • Coefficient hrinkage (smoothing) • Using derived input direction • Multiple outcome shrinkage and selection • Most of the algorithms are sensitive to scaling of the parameters • Standardize the inputs, such as normalizing input directions to the same variance

More Related