1.33k likes | 1.34k Views
Machine Learning (BE Computer 2015 PAT) A.Y. 2018-19 SEM-II Prepared by Mr. Dhomse G.P. Unit-3 Regression Syllabus.
E N D
Machine Learning (BE Computer 2015 PAT) A.Y. 2018-19 SEM-II Prepared by Mr. Dhomse G.P.
Unit-3 Regression Syllabus • Linear regression- Linear models, A bi-dimensional example, Linear Regression and higher dimensionality, Ridge, Lasso and Elastic Net, Robust regression with random sample consensus, Polynomial regression, Isotonic regression,Logistic regression-Linear classification, Logistic regression, Implementation and Optimizations, Stochastic gradient descendent algorithms, Finding the optimal hyper-parameters through grid search, Classification metric, ROC Curve.
Linear Regression • Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). • More specifically, that y can be calculated from a linear combination of the input variables (x). • When there is a single input variable (x), the method is referred to as simple linear regression. • When there are multiple input variables, literature from statistics often refers to the method as multiple linear regression
Simple Linear Regression • Simple regression problem (a single x and a single y), the form of the model would be: Constant Coefficient y = b0 + b1 *x1 Independent variable (IV) Dependent variable (DV)
Simple LinearRegression EQUATIONPLOTTING SALARY (₹) y = b0 + b1 *x1 SALARY = b0 + b1 *EXPERIENCE +10K +1Yr HOW much Salary willincrease? +1Yr EXPERIENCE
Simple Linear Regression ANALYZING DATASET DV IV
Simple LinearRegression • LET's CODE! • Prep your Data Preprocessing Template • ImportDataset • Noneed for Missing Data • Splitting into Training& Testing dataset • Keep Feature Scaling but least preffered here • Co-relate Salaries with Experience • Later carry out prediction • Verify the Values of prediction • Prediction on TEST SET
Example-2 • Let’s make this concrete with an example. Imagine we are predicting weight (y) from height (x). Our linear regression model representation for this problem would be: y = B0 + B1 * x1 or weight =B0 +B1 * height
Where B0 is the bias coefficientand B1 is the coefficient for the height column. We use a learning technique to find a good set of coefficient values. Once found, we can plug in different height values to predict the weight. • For example, lets use B0 = 0.1 and B1 = 0.5. Let’s plug them in and calculate the weight (in kilograms) for a person with the height of 182 centimeters. weight = 0.1 + 0.05 * 182 weight = 91.1 • You can see that the above equation could be plotted as a line in two-dimensions. The B0 is our starting pointregardless of what height we have. • We can run through a bunch of heights from 100 to 250 centimeters and plug them to the equation and get weight values, creating our line.
Multi LinearRegression Coefficients Constant y = b0+ b1 * x1 + b2 * x2 + ... + bn *xn Independent variables (IVs) Dependent variable (DV)
Multiple linear regressionanalysis makes several key assumptions: • Multivariate Normality–Multiple regression assumes that the residuals are normally distributed. • No Multicollinearity—Multiple regression assumes that the independent variables are not highly correlated with each other. This assumption is tested using Variance Inflation Factor (VIF) values. • Homoscedasticity–This assumption states that the variance of error terms are similar across the values of the independent variables. A plot of standardized residuals versus predicted values can show whether points are equally distributed across all values of the independent variables. • Intellectus Statistics automatically includes the assumption tests and plots when conducting a regression.
Multiple linear regression requires at least two independent variables, which can be nominal, ordinal, or interval/ratio level variables. • A rule of thumb for the sample size is that regression analysis requires at least 20 cases per independent variable in the analysis. • First, multiple linear regression requires the relationship between the independent and dependent variables to be linear. • The linearity assumption can best be tested with scatterplots. The following two examples depict a curvilinear relationship (left) and a linear relationship (right).
curvilinear relationship (left) and a linear relationship (right).
Second, the multiple linear regression analysis requires that the errors between observed and predicted values (i.e., the residuals of the regression) should be normally distributed. • This assumption may be checked by looking at a histogramor a Q-Q-Plot. Normality can also be checked with a goodness of fit test (e.g., the Kolmogorov-Smirnov test), though this test must be conducted on the residuals themselves. • Third, multiple linear regression assumes that there is no multicollinearity in the data. • Multicollinearity occurs when the independent variables are too highly correlated with each other.
Multicollinearity may be checked multiple ways: • 1) Correlation matrix– When computing a matrix of Pearson’s bivariate correlations among all independent variables, the magnitude of the correlation coefficients should be less than .80. • 2) Variance Inflation Factor (VIF) – The VIFs of the linear regression indicate the degree that the variances in the regression estimates are increased due to multicollinearity. VIF values higher than 10 indicate that multicollinearity is a problem. • If multicollinearity is found in the data, one possible solution is to center the data. To center the data, subtract the mean score from each observation for each independent variable. However, the simplest solution is to identify the variables causing multicollinearity issues (i.e., through correlations or VIF values) and removing those variables from the regression.
A scatterplot of residuals versus predicted values is good way to check for homoscedasticity. There should be no clear pattern in the distribution; if there is a cone-shaped pattern (as shown below), the data is heteroscedastic.
Multi LinearRegression DUMMY VARIABLES Categorical Variable
Multi LinearRegression DUMMY VARIABLES D y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 *D1
Multi LinearRegression DUMMY VARIABLETRAP D D2 = 1 -D1 Multi LinearColineari y= b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * D1 + b5 *D2 Always OMIT one DummyVariable
Building AModel STEP BYSTEP
Building AModel • METHODS OF BUILDING AMODEL • All -in • BackwardElimination • Forward Elimination • ForwardSelection • BidirectionalElimination • Score Comparison Stepwiseregression
Building AModel • METHODS OF BUILDING AMODEL • ALL -IN • Throw in everyvariable • PriorKnowledge • KnownValues • Preparing Backwardelimination
Building AModel • BACKWARD ELIMINATIONMODEL (Best Model in All ) • Step1 • Select significance level to stay in model(0.05) • Step2 • Fit in full model with all possiblepredictors • Step3 • Consider the predictor with highest Pvalue • If P > SL, go to Step 4, otherwise go toFIN • Step4 MODEL BUILT • Remove thePredictor • Step5 • Fit the model w/o thisvariable*
Bi-Dimensional Example • Let's consider a small dataset built by adding some uniform noise to the points belonging to a segment bounded between -6 and 6
The original equation is: y = x + 2 + n, where n is a noise term. • Figure shows , there's a plot with a candidate regression function: • As we're working on a plane, the regressor we're looking for is a function of only two parameters: • In order to fit our model, we must find the best parameters and to do that we choose an least squares approach.
This task can be easily accomplished by Least Square Method. • It is the most common method used for fitting a regression line. • It calculates the best-fit line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. • Because the deviations are first squared, when added, there is no cancelling out between positive and negative values.
The loss function to minimize is: • So (for simplicity, it accepts a vector containing both variables): import numpy as np def loss(v): e = 0.0 for i in range(nb_samples): e += np.square(v[0] + v[1]*X[i] - Y[i]) return 0.5 * e
in order to find the global minimum, we must impose: the gradient can be defined as: def gradient(v): g = np.zeros(shape=2) for i in range(nb_samples): g[0] += (v[0] + v[1]*X[i] - Y[i]) g[1] += ((v[0] + v[1]*X[i] - Y[i]) * X[i]) return g
The optimization can now be solved using SciPy: • scipy.optimize.minimize Parameters: fun : callableThe objective function to be minimized. • fun(x, *args) -> float • where x is an 1-D array with shape (n,) and args is a tuple of the fixed parameters needed to completely specify the function. x0 : ndarray, shape (n,)Initial guess. Array of real elements of size (n,), where ‘n’ is the number of independent variables. args : tuple, optionalExtra arguments passed to the objective function and its derivatives (fun, jac and hess functions). method : str or callable, optionalType of solver. Should be one of • ‘Nelder-Mead’ (see here) • ‘Powell’ (see here) • ‘CG’ (see here) • ‘BFGS’ (see here) • ‘Newton-CG’ (see here) • ‘L-BFGS-B’ (see here)
‘TNC’ (see here) • ‘COBYLA’ (see here) • ‘SLSQP’ (see here) • ‘trust-constr’(see here) • ‘dogleg’ (see here) • ‘trust-ncg’ (see here) • ‘trust-exact’ (see here) • ‘trust-krylov’ (see here) • custom - a callable object (added in version 0.14.0), see below for description. • If not given, chosen to be one of BFGS, L-BFGS-B, SLSQP, depending if the problem has constraints or bounds. • jac : {callable, ‘2-point’, ‘3-point’, ‘cs’, bool}, optionalMethod for computing the gradient vector. • hess : {callable, ‘2-point’, ‘3-point’, ‘cs’, HessianUpdateStrategy}, optionalMethod for computing the Hessian matrix
>>> from scipy.optimize import minimize >>> minimize(fun=loss, x0=[0.0, 0.0], jac=gradient, method='L-BFGS-B') fun: 9.7283268345966025 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64> jac: array([ 7.28577538e-06, -2.35647522e-05]) message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH' nfev: 8 nit: 7 status: 0 success: True x: array([ 2.00497209, 1.00822552]) • As expected, the regression denoised our dataset, rebuilding the original equation: y = x + 2.
Scipy Optimization Example using Python • Optimization deals with selecting the best option among a number of possible choices that are feasible or don't violate constraints. • Mathematical optimization problems may include equality constraints (e.g. =), inequality constraints (e.g. <, <=, >, >=), objective functions, algebraic equations, differential equations, continuous variables, discrete or integer variables, etc.
This problem has a nonlinear objective that the optimizer attempts to minimize. The variable values at the optimal solution are subject to (s.t.) both equality (=40) and inequality (>25) constraints. The product of the four variables must be greater than 25 while the sum of squares of the variables must also equal 40. In addition, all variables must be between 1 and 5 and the initial guess is x1 = 1, x2 = 5, x3 = 5, and x4 = 1. https://www.youtube.com/watch?v=cXHvC_FGx24
Linear regression with scikit-learn and higher dimensionality • scikit-learn offers the class LinearRegression, which works with n-dimensional spaces. For this purpose, we're going to use the Boston dataset: from sklearn.datasets import load_boston >>> boston = load_boston() >>> boston.data.shape (506L, 13L) >>> boston.target.shape (506L,)
It has 506 samples with 13 input features and one output. In the following figure, there' a collection of the plots of the first 12 features:
How to Find Accuracy of Model • Model to normalize the data before processing it. Moreover, for testing purposes, we split the original dataset into training (90%) and test (10%) sets: • from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.1) >>> lr = LinearRegression(normalize=True) >>> lr.fit(X_train, Y_train) • LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
To check the accuracy of a regression, scikit-learn provides the internal method score(X, y) which evaluates the model on test data: >>> lr.score(X_test, Y_test) 0.77371996006718879 • So the overall accuracy is about 77%, which is an acceptable result considering the non- linearity of the original dataset, but it can be also influenced by the subdivision made by train_test_split (like in our case).
we can use the function cross_val_score(), which works with all the classifiers. • The scoring parameter is very important because it determines which metric will be adopted for tests. • As LinearRegression works with ordinary least squares, we preferred the negative mean squared error, which is a cumulative measure that must be evaluated according to the actual values (it's not relative).
from sklearn.model_selection import cross_val_score >>> scores = cross_val_score(lr, boston.data, boston.target, cv=7, scoring='neg_mean_squared_error') array([ -11.32601065, -10.96365388, -32.12770594, -33.62294354,- 10.55957139, -146.42926647, -12.98538412]) >>> scores.mean() -36.859219426420601 >>> scores.std() 45.704973900600457
Another very important metric used in regressions is called the coefficient of determination or R2. It measures the amount of variance on the prediction which is explained by the dataset >>> cross_val_score(lr, X, Y, cv=10, scoring='r2') 0.75 CV- Cross Validation Algo-10 R2 ~1
Big Mart Sales-In the data set, we have product wise Sales for Multiple outlets of a chain.
Ridge & Lasso • Ridge and Lasso regression are powerful techniques generally used for creating parsimonious models in presence of a ‘large’ number of features.Here ‘large’ can typically mean either of two things: • Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting) • Large enough to cause computational challenges. With modern systems, this situation might arise in case of millions or billions of features • Though Ridge and Lasso might appear to work towards a common goal, the inherent properties and practical use cases differ substantially. If you’ve heard of them before, you must know that they work by penalizing the magnitude of coefficients of features along with minimizing the error between predicted and actual observations. These are called ‘regularization’ techniques.
Why Penalize the Magnitude of Coefficients? • Lets try to understand the impact of model complexity on the magnitude of coefficients. As an example, I have simulated a sine curve (between 60° and 300°)
This resembles a sine curve but not exactly because of the noise. • We’ll use this as an example to test different scenarios in this article. • Lets try to estimate the sine function using polynomial regression with powers of x form 1 to 15. • Lets add a column for each power upto 15 in our dataframe.
Now that we have all the 15 powers, lets make 15 different linear regression models with each model containing variables with powers of x from 1 to the particular model number. For example, the feature set of model 8 will be – {x, x_2, x_3, … ,x_8}. • RSS refers to ‘Residual Sum of Squares’ which is nothing but the sum of square of errors between the predicted and actual values in the training data set. • We would expect the models with increasing complexity to better fit the data and result in lower RSS values. • This can be verified by looking at the plots generated for 6 models:
As the model complexity increases, the models tends to fit even smaller deviations in the training data set. • Though this leads to overfitting, lets keep this issue aside for some time and come to our main objective, i.e. the impact on the magnitude of coefficients. • See the Out put in coef_matrix_simple • It is clearly evident that the size of coefficients increase exponentially with increase in model complexity