290 likes | 420 Views
Linear Regression. Daniel Baur ETH Zurich, Institut für Chemie- und Bioingenieurwissenschaften ETH Hönggerberg / HCI F128 – Zürich E-Mail: daniel.baur@chem.ethz.ch http://www.morbidelli-group.ethz.ch/education/index . Regression Analysis.
E N D
Linear Regression Daniel BaurETH Zurich, Institut für Chemie- und BioingenieurwissenschaftenETH Hönggerberg / HCI F128 – ZürichE-Mail: daniel.baur@chem.ethz.chhttp://www.morbidelli-group.ethz.ch/education/index Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Regression Analysis • Aim:To know to which extent a certain response (dependent) variable is related to a set of explanatory (independent) variables • Example: James David Forbes (Edinburgh 1809 – 1869) Response Observations Professor in glaciology. He measured the boiling point of water and the atmospheric pressure at 17 different locations in the Swiss alps (Jungfrau) and in Scotland with the aim of using the boiling temperature of water to estimate the altitude. Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Linear Regression Model • As inputs for our model we use two vectors x and Y, where • xi is the i-th observation • Yi is the i-th response • The model reads: • At this point, we make a fundamental assumption: • As outputs from our regression we get estimated values for the regression parameters: The errors are mutually independent and normally distributed with mean zero and variance σ2: A regression is called linear if it is linear in the parameters! Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
The Errors ε • Since the errors are assumed to be normally distributed, the following is true for the expectation values and variance of the model responses Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Estimating the Parameters • The parameters β are chosen to be optimal in a least squares sense • The objective function S is a measure of how close the regression line lies to the observations Minimize S Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Example: Boiling Temperature and Pressure Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Averages Estimation of b0 and b1 Parameter Estimation (Manual) Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Parameter Estimation (Built-In Function) a = confidence interval Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Outlier Residuals Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Removing the Outlier Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Analysis of Variance (ANOVA) • Total Sum of Squares • Sum of Squares due to Regression • Sum of Squares due to Error • Coefficient of Determination R2 = 1 ei = 0 R2 = 0 regression does not explain variation of Y Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
The LinearModel and dataset Classes • Matlab 2012 features two classes that are designed specifically for statistical analysis and linear regression • dataset • creates an object that holds data and meta-data like variable names, options for inclusion / exclusion of data points, etc. • LinearModel • is constructed from datasets or X, Y pairs (as with the regress function) and a model description • automatically does linear regression and holds all important regression outputs like parameter estimates, residuals, confidence intervals etc. • includes several useful functions like plots, residual analysis, exclusion of parameters etc. Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Classes in Matlab • Classes define a set of properties (variables) and methods (functions) which operate on those properties • This is useful for bundling information together with ways of treating and modifying this information • When a class is instantiated, an object of this class is created which can be used with the methods of the class, e.g. mdl = LinearModel.fit(X,Y); • Properties can be accessed with the dot operator, like with structs (e.g. mdl.Coefficients) • Methods can be called either with the dot operator, or by having an object of the class as first input argument (e.g. plot(mdl) or mdl.plot()) Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Working with LinearModel and dataset • First, we define our observed and measured variables, giving them appropriate names, since these names will be used by the dataset and the LinearModel as meta-data Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Working with LinearModel and dataset • Next, we construct the dataset from our variables Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Working with LinearModel and dataset • After defining the relationship between our data (a model), we can use the dataset and the model to construct a LinearModel object • This will automatically fit the data, perform residual analysis and much more Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
LinearModel: Plot • Now that we have the model, we can do the same things we did manually before, but much easier Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Linear Model: Tukey-Anscombe Plot • Plot residuals vs. fitted values; These should be randomly distributed around 0 Outlier? Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
LinearModel: Cook’s Distance • The Cook’s distance measures the effect of removing one measurement from the data Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Linear Model: Removing the Outlier • After identifying an outlier, it can be easily removed Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Multiple Linear Regression • Approximate model • Residuals • Least squares Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Exercise • The data file asphalt.dat (online), contains data from a degradation experiment for different concrete mixtures[1] • The rutting (erosion) in inches per million cars (RUT) is measured as a function of • viscosity (VISC) • percentage of asphalt in the surface course (ASPH) • percentage of asphalt in the base course (BASE) • an operating mode 0 or 1 (RUN) • percentage (*10) of fines in the surface course (FINES) • percentage of voids in the surface course (VOIDS) [1] R.V. Hogg and J. Ledolter, Applied Statistics for Engineers and Physical Scientists, Maxwell Macmillan International Editions, 1992, p.393. Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Assignment • The LinearModel class only exists in Matlab 2012 or newer • There are two versions of the assignment, one for Matlab 2012 and one for older versions, do one of the two Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Assignment (Matlab 2012 and newer only) • Find online the file readVars.m that will read the data file and assign the variables RUT, VISC, ASPH, BASE, RUN, FINES and VOIDS; You can copy and paste this script into your own file. • Create a dataset using the variables from 1. • Set the RUN variable to be a discrete variable • Assuming your dataset is called ds, useds.RUN = nominal(ds.RUN); • Create a modelspec string • To include multiple variables in the modelspec, use the plus sign • Fit your model using LinearModel.fit, display the model output and plot the model. Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Assignment (Continued) • Which variables most likely have the largest influence? • Generate the Tukey-Anscombe plot. Is there any indication of nonlinearity, non-constant variance or of a skewed distribution of residuals? • Plot the adjusted responses for each variable, using the plotAllResponsesfunction you can find online • The variables seem to show a rather random response, except for VISC which seems to mostly lie on one of the axes. Try and transform the system by defining • logRUT = log10(RUT); logVISC = log10(VISC); • Define a new dataset and modelspec using the transformed variables. Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Assignment (Continued) • Fit a new model with the transformed variables and repeat the analysis from before. • With the new model, try to remove variables that have a small influence. To do this systematically, use the function step, which will remove and/or add variables one at a time: • reduced_model = step(mdl2, 'nsteps', 20); • Which variables have been removed and which of the remaining ones most likely have the largest influence? Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Assignment (older versions than Matlab 2012) • Find online the file readVars.m that will read the data file and assign the variables RUT, VISC, ASPH, BASE, RUN, FINES and VOIDS; You can copy and paste this script into your own file. • Create the matrix X using the variables from 1 except RUT and a column of ones. • Create the vector Y using RUT • Fit your model using regress and and alpha = 0.05 • Display the estimated values of beta and the confidence intervals • Are any of the values not significantly different from 0, i.e. does 0 lie inside the confidence interval? Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression
Assignment (Continued) • Generate the Tukey-Anscombe plot. Is there any indication of nonlinearity, non-constant variance or of a skewed distribution of residuals? • Plot the response of all the variables using plotmatrix(aspData(:,1:6), RUT). The variables seem to show a rather random response, except for VISC. Try and transform the system by defining • logRUT = log10(RUT); logVISC = log10(VISC); • Define a new X matrix and a new Y vector and regress again • Comment again on the estimates and their significance • Reproduce the Tukey-Anscombe plot. Did anything change? Daniel Baur / Numerical Methods for Chemical Engineers / Linear Regression