250 likes | 464 Views
Prediction with Regression. An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi. Outline. Prediction Estimation Bias Variance Trade-Off Regression Ordinary Least square Ridge regression Lasso. Prediction definition. set of inputs: X 1 , X 2 , …, X p
E N D
Prediction with Regression An Introduction to Linear Regression and Shrinkage Methods EhsanKhoddamMohammadi
Outline • Prediction • Estimation • Bias Variance Trade-Off • Regression • Ordinary Least square • Ridge regression • Lasso
Predictiondefinition • set of inputs: X1, X2, …, Xp • the output: Y • We want to analyze the relationship between these variables (interpretation) • We want to estimate output based on inputs (prediction)
Predictionsame concept in different literatures • Machine learning: supervised learning • Finance: forecasting • Politics: prediction • Estimation theory: function approximation
RegressionWhy? • Well-performed and accurate in both Interpretation and Prediction • Strong fundamental in math, statistics and computation • Many modern and advanced methods are based on Regression, even they are variant of regression • New methods are still invented for regression: • Nobel prize are still given to investigations in regression, Hot topic • Could be formulated as optimization problem: • that’s the reason I choose it for this class, it’s more related to subject of class than any other methods I’ve known for prediction
Regressionclassification • Linear Regression • Least square • Best sub-sut Selection, Regression with feature selection • Stepwise Regression • Shrinkage regularization for Regression: • Ridge Regression • Lasso Regression • Non-Linear Regression • Numerical Data fitting • ANN • Discrete regression • Logistic Regression
Before proceeding with regression Let’s investigate on some statistical property of ESTIMATION
Estimating the parameter • assume that we have iid (identically independent distributed) samples X1, . . . ,Xn with unknown distribution. • Estimating p.d.f of them is too hard in many situations, Instead of that, We want to estimate a parameter θ . • is estimation of θ, it is function of X1, . . . ,Xn .
Bias-Variance dilemma • Definition 1 : The bias of an estimator is . If it is 0, the estimator is said to be unbiased. • Definition 2 : The mean squared error (MSE) of an estimator is . • An interesting equation: What does it really mean?
[Image from “More on Regularization and (Generalized) Ridge Operators”, Takane,(2007)]
Test and training error as a function of model complexity. [ Image from “The Elements of Statistical Learning”,Second Edition, Hastie et al. (2008)]
Linear RegressionModel • Set of training data : • Linear Regression model: • Real-valued coefficients β need to be estimated
Linear RegressionLeast square • Most popular estimation method • Minimize the Residual Sum of Squares: How do we minimize it?
Linear RegressionLeast square • Let’s rewrite last formula in this form: • Quadratic function (not a point here but we shall use this property later) • Differentiating respect to β and set it to zero: • Unique Solution: ; Under which assumptions we could obtain unique solution?
Linear RegressionLeast square, Assumptions • X should be full-rank, hence is p.d and invertible, unique solution could be obtained • In another word, features vectors should be linearly independent or uncorrelated • What will be happened to β if X would be non-full-rank matrix or some features would be highly correlated?
Linear RegressionLeast square, flaws • Low bias but High variance: and one could estimate Var(y) by: • It’s hard to find meaning-full relation if we have too many features. What would you recommend to solve these problems?
Linear RegressionImprovements • Model Selection (Feature Selection): • Best-Subset Selection (Branch and Leap , Furnival (1974)) • Step-wise Selection (Greedy approach, sub-optimal but preferred) • mRMR (using mutual information criterion for selection) • Shrinkage Methods: impose constraint on β • Ridge Regression • Lasso Regression
Ridge Regression • When you have a problem want to be solved in statistics, There is always a Russian statistician waiting for you to solve it. (Be careful! just in statistics I guarantee , they will betray you in any other situations) • Andrey Nikolayevich Tychonoffprovides a Tikhonov (!!!) regularization for ill-posed problems , Also known as Ridge Regression in statistics.
Ridge Regressionfirst attempt • Remember this?: Tychonoff added a term to avoid singularity and changed above formula to this: Now, the inverse could be computed even if Is not of full-rank, Also β is still linear function of y. Every thing start from above formula but now we have better point of view than Tychonoff, let’s take a look!
Ridge Regressionbetter motivation • To avoid high variance of β we just impose a constraint on it, our problem is now an optimization problem with constraints.
Even better representation: using lagrangian form Or again even better! in matrix representation form, we could differentiate this formula and set it to zero Could you guess the solution? Could you find a relation between β and βridge when inputs are orthonormal?
LASSO Least Absolute Selection and Shrinkage Operator
LASSO • We impose L1-norm constraint on our regression • No close form exists, it’s non-linear function of y How could you solve above problem? (hint: ask Mr.Iranmehr!)
LASSOWhy? • First attempt for usage of L1-norm, show significant results in signal processing, denoising [Chen et al. (1998)] • Base method for LAR (new and novel method for regression, not covered here) [Efron et al. (2004)] • Good for Sparse model selection where p>N [Donoho (2006b)]
REFERENCES • “The Elements of Statistical Learning”, Second Edition, Hastie et al. , 2008 • “More on Regularization and (Generalized) Ridge Operators”, Takane, 2007 • “Bias, Variance and MSE of Estimators”, Guy Lebanon, 2004 • “Least Squares Optimization with L1-Norm Regularization”, Mark Schmidt, 2005 • “Regularization: Ridge Regression and the LASSO”, Tibshirani, 2006