Multiple Regression Analysis

Multiple Regression Analysis Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 2 January 26, 2011.

Introduction We extend the concept of simple linear regression as we investigate a response y which is affected by several independent variables x1,x2,x3,…,xk. Our objective is to use the information provided by the xi to predict the value of y.

Example Rating Prediction We have a database of movie ratings (on the scale of 1 to 10) from various users We can predict a missing rating for a movie m from the user Y based on: • other movie ratings from the same user • ratings given to movie m by other users

People ratings

Illustrative Example Body Fat Prediction • Let y be the body fat index of an individual. This might be a function of several variables: • x1 = height • x2 = weight • x3 = abdomen measurement • x4 = hip measurement • We want to predict y using knowledge of x1, x2, x3 and x4.

Formalization of the Regression Model For each observation i, the expected value of the dependent y conditional on the information X1, X2, …., Xp is given by: We add some noise to this expectation model, then value of y becomes: Combining both equations we have

Formalization of the Regression Model We can express the regression model in a matrix terms: Where y is a vector of n X1, and b is a vector of order [(p+1)x1] and X is configured as: The vector of 1 corresponds to de dummy variable that is multiplied by the parameter b0

Assumptions of the Regression Model Assumptions of the data matrix X • It is fixed for fitting purposes • It is full rank Assumptions of the random variable e • e are independent • Have a mean 0 and common variance s2 for any set x1, x2,..., xk . • Have a normal distribution.

Method of Least Squares The method o least squares is used to estimate the values of b which achieves the minimum sum of squares between the observations yi and the straight line formed by these parameters and the variables x1, x2, ... Xk.

Mechanics of the Least Square Estimator in multiple variables • The objective function is to choose to minimize • We differentiate the expression above with respect to and we obtain: • Solving for we obtain

Mechanics of the Least Square Estimator in multiple variables The variance for the estimator is We know that: And the expected value of (εε) is

Mechanics of the Least Square Estimator in multiple variables Substituting in the equation above

Properties of the least squares estimator • The estimate is unbiased • The estimate has the lowest variance estimator of b

Example A computer database in a small community contains: the listed selling price y (in thousands of dollars) per acre, the amount of living area x1 acres, and the number of floors x2, bedrooms x3, and bathrooms x4, for n = 15 randomly selected residences currently on the market. Fit a first order model to the data using the method of leastsquares.

Example The first order model is y= b0+b1x1 + b2x2 + b3x3 + b4x4 X= Y= b=(X’X)-1X’ y=

Some Questions • How well does the model fit? • How strong is the relationship between y and the predictor variables? • Have any assumptions been violated? • How good are the estimates and predictions? To answer these question we need the n observations on the response yand the independent variables, x1, x2, x3, …xk.

Residuals • The difference between the observed value yi and the corresponding fitted value ŷi is a residual and it is defined as: • If we elevate the residuals to the square we will obtain a chi-square distribution

Residuals versus Fits • If the normal assumption is valid, the plot of the residuals should appear as a random scatter around the zero center line. • If not, you will see a pattern in the residuals.

Residuals versus Fits If we see a pattern on the residuals then this method may not be appropriate to fit the data • The descriptors and the predicted variable do not follow a linear relationship • We can transform the descriptors in order to achieve a better fitting

How good is the fit ? • Our objective in regression is choose the parameters b0, b1,…, bk that provide the most accurate fitted values of y by minimizing the uncertainty after using the information about X. • This uncertainty may be denoted by the measure R2 which is equal to:

How good is the fit? • We are interested in having a high value of R2 when we are evaluating our data • One drawback of R2 is that whenever an independent variable to the model this measure always increases (increase of degrees of freedom). • we calculate the adjusted R2 to weigh the improvement in fit versus the cost in degrees of freedom Where p is the number of degrees of freedom and n the number of samples used to fit the model

= = = = H : b b ... b 0 versus 0 1 2 k H : at least one b is not zero a i Is it significant? • The first question to ask is whether the regression model is of any use in predicting y. • If it is not, then the value of y does not change, regardless of the value of the independent variables, x1, x2 ,…, xk. This implies that the partial regression coefficients, b1, b2,…, bk are all zero.

F statistics • An F statistic is the ratio of two independent χ2 random variables, each divided by its respective degrees of freedom. The key point is to show that both are independent

å - 2 ˆ ( y y ) / p i = i F å - - - 2 ˆ ( y y ) /( n p 1 ) i i i Fstatistic with (p, n - p - 1) degrees of freedom. Is it significant? • Testing the model using F test. This consist in the ratio of the error term (ε) and the variance of the residual terms (e). If the model is useful, the value of F will be large.

Analysis of variance table SS Sum of squares df degrees of freedom MS mean squares F F statistics

S = 6.849 R-Sq = 97.1% R-q(adj) = 96.0% Example • Table of variance of the real state example

= ¹ H : b 0 versus H : b 0 0 a i i Testing Individual Parameters • Is a particular independent variable useful in the model, in the presence of all the other independent variables? The test statistic is function of bi, our best estimate of bi.

ˆ - b 0 = i Test statistic : t v ii t statistics • A t distribution is defined as a normal distribution divided by a chi-square distribution. • If we want to test the significance of a particular coefficient, we will calculate the t value of the coefficient . Where vii is the element i,i of the matrix has a t distribution with error df = n – p –1. We reject that b=0 if |t0|>t/2,n-2

Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths The regression equation is ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths Predictor Coef SE Coef T P Constant 18.763 9.207 2.04 0.069 SqFeet 6.2698 0.7252 8.65 0.000 NumFlrs -16.203 6.212 -2.61 0.026 Bdrms -2.673 4.494 -0.59 0.565 Baths 30.271 6.849 4.42 0.001 The Real Estate Problem In the presence of the other three independent variables, is the number of bedrooms significant in predicting the list price of homes? Test using a = .05.

Detecting problems in the regression Multicollinearity • This problem is related to the high correlation between the independent variables • Multicollinearity leads to the inestability in the calculation of the inverse (X’X)-1 • We can measure it by calculating the condition index (CI) given by • The higher the value is the most unstable the inverse of the matrix is. The matrix has multicollinearity if CI >30 Where dmax and dmin are the maximum value and the minimum value of the maxtrix D obtained from the SVD decomposition of X

Detecting problems in the regression Heteroscedasticity When the assumption of the same variance 2 in the error terms εi is violated. This yields to a low efficiency in the calculation of then this estimator may not have the lower variance or be biased We can address this problem by transforming the dependent variable (obtaining the logarithm), or use Weighted least squares to estimate

Detecting problems in the regression Autocorrelation • Is the problem of consecutive error terms in time series data being correlated • The consequences of this problem are similar to heteroscedasticity • We can detect this problem by plotting the residuals and look for patterns or use the Durbin-Watson test

Durbin Watson Test • This test is given by the ratio: Where e are the residuals • This is approximately equal to 2(1-r), where r is the autocorrelation. Then if DW is close to 2 there is no evidence of autocorrelation.

Comparing Regression Models • To fairly compare two models, we can use: • The adjusted R2 • The F test • This takes into account the difference between degrees of freedom

Estimation and Prediction • Once you have: • determined that the regression line is useful • used the diagnostic plots to check for violation of the regression assumptions. You are ready to use the regression line to Predict a particular value of y for a given value of x.

ˆ ˆ ˆ ˆ = + + + + y b b x b x ... b x ˆ 0 1 1 2 2 k k Estimation and Prediction • Enter the appropriate values of x1, x2, …, xk Particular values of y are more difficult to predict, then it is required to have a wider range of values in the prediction interval.

We estimate that the average list price will be between $110,860 and $124,700 for a home like this. Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 117.78 3.11 ( 110.86, 124.70) ( 101.02, 134.54) Values of Predictors for New Observations New Obs SqFeet NumFlrs Bdrms Baths 1 10.0 1.00 3.00 2.00 The Real Estate Problem Estimate the average list price for a home with 1000 square feet of living space, one floor, 3 bedrooms and two baths with a 95% confidence interval.

Example: Body Fat Calculation Predict the amount of body fat index from the body based on the following measures: X1 Age in years X2 Weight in lbs X3 Height in inches X4 neck measure in cm X5 chest measure in cm X6 abdomen measure in cm X7 hip measure in cm X8 thigh in cm

Example • Data set example Y= X=

Regression Analysis • The result of the regression analysis is the following:

Example • Residuals plot

Example • The prediction for a new sample b= P= Y=P b Y=29.2801

Multiple Regression Analysis