Regression Analysis

^ Y Regression Analysis • Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. = bo + bX • bois called the Y intercept - represents the value of Y when X = 0. But be cautious - this interpretation may be incorrect and difficult to estimate - many times our data does not include 0. Think of this value as representing the influences of the many other independent variables that are not included in the equation. • bX is called the slope - represents the amount of change in Y when X increases by one unit.

Regression Analysis • Regression line - line that best fits a collection of X-Y data points. The regression line minimizes the sum of the squared distances from the points to the line. • Regression equation - Method of Least Squares. Find bo and bX. Other models: step-wise, forward and backward stepwise.

Regression Assumptions • Y values are normally distributed about the regression line • Variance remains constant as X values increase and decrease. Violation is called heteroscedasticity. • Error terms (residuals) are independent of one another - random (no autocorrelation) • Linear relationship exists between X and Y - nonlinear techniques are discussed later.

Excel’s Regression Tool Tools, Data Analysis, Regression - Hint: Include labels in the input ranges to help with the interpretation! Can also include plots (not shown here)

^ Y Total Deviation = Explained Variance + Unexplained Variance Comparison of A Forecasted value to the actual value and average. Sales Total Variance Explained Unexplained Y Advertising

Data Analysis • R2 or Coefficient of Determination. Equals the proportion of the variance in the dependent variable Y that is explained through the relationship with the Independent variable X. • Explained Variance = Total Variance - Unexplained Variance. We state this as a proportion: Adjusted: Unadjusted: Adjusted R2 - adjusted for complexity by the degrees of freedom. Unadjusted R2 becomes larger as more variables are added to the equation (decreases the sum of errors in the denominator). The use of an unadjusted R2 may result in believing that additional variables are useful when they are not.

More on R2 • If R2 = 1, there is a perfect linear relationship. All the variance in Y is explained by X. All of the data points are on the regression line. • If R2 = 0, there is no relationship between X and Y (if this is the case, we should not have run a linear model - and we should have realized this with a correlation coefficient and by graphing - BEFORE running the model! • Several ways to calculate. From ANOVA table: SSR/SST (this is an UNADJUSTED R2 ) • Adjusted R2 from ANOVA = 1-MSE/(SST/n-1) • The square root of R2 is R which is the correlation coefficient. This identifies positive and negative relationships • R2 is useful to make model comparisons

^ Y Data Analysis • Syx or Standard Error - measure for goodness of fit. Measures the actual values (Y) against the regression line Lower Syx is a better fit k refers to the number of population parameters being estimated - in this case, we have 2: boand bX The standard error can also be calculated by taking the square root of the MSE in the ANOVA table! Syx =

Residuals Excel will provide the residuals in the output. This table also includes another column that I added - the residuals squared which is used to determine the standard error of the estimate (Syx)

^ Y Confidence Intervals • Prior to relating Y to X, confidence intervals about the future values are based on the standard error of Y. However, in the regression equation, the standard error of forecast (Sf) gives tighter confidence intervals and greater accuracy. • Confidence Interval for Y: • Confidence Interval for : Use ta/2 for small sample sizes!

Making Predictions • Identifying a forecasted point from the regression equation does not give us an idea of the accuracy of the prediction. We use the prediction interval to determine accuracy. For example, a prediction of 8.44 appears to be precise - but not if the 95% confidence level allows the forecast to be between 1.75 to 15.15! • Be careful about making a prediction based on a prediction. For example, if the X values range between 5 and 15, you should be cautious about using an X value of 20 - it is outside the range of the data and possibly outside of the linear relationship.

Is the Independent Variable Significant? • Ho: The regression coefficient is not significantly different from zero • HA: The regression coefficient is significantly different from zero Where B is the true slope of the regression line

Is the Independent Variable Significant? • The Standard Error of the Estimate is Syx, • The Standard Error of the Regression Coefficient is Sb. We will use Excel’s P-value for the Independent Variable to determine significance. If the p-value is less than .05, we Reject the null hypothesis and conclude that the Independent variable is related to the dependent variable. However, it is important to have an understanding of the formulation development - which is why the formulas and definitions are provided.

Analyzing it all at once • What happens if you have a large sample size, a small R2 (such as .10) and you have determined that the independent variable is significant? • What happens with a small sample, large R2 and the independent variable is NOT significant? • To test the model, we use the F statistic from the ANOVA table.

ANOVA Analysis ANOVA df SS MS F Regression Error Total k-1 n-k n-1 SSR/k-1 SSE/n-k MSR/MSE

F-Test • Ho: The model is NOT valid and there is NOT a statistical relationship between the dependent and independent variables • HA: The model is valid. There is a statistical relationship between the dependent and independent variables. If F from the ANOVA is greater than the F from the F-table, reject Ho: The model is valid. We can look at the P-values. If the p-value is less than our set a level, we can REJECT Ho.

Durbin-Watson Statistic • Minitab will provide a DW statistic. This detects autocorrelation for Yt and Yt-1. The value of DW varies between 0 and 4. • A value of 2 indicates no autocorrelation. • A value of 0 indicates positive autocorrelation • A value of 4 indicates negative autocorrelation.

Data Transformations • Curvilinear relationships - fit the data with a curved line • Transform the X variable (independent) so the resulting relationship with Y is linear. • Log of X, Square Root of X, X squared, and reciprocal of X (or 1/X) are common. The hope is that one of these transformations will result in a linear relationship.

Ok, 18 pages of notes, so where do we start? • Determine the dependent and independent variables • Develop scatter plots and determine if linear or nonlinear relationships exist. Calculate a correlation coefficient. Transform non-linear data. • Run an autocorrelation and interpret the results - it will be helpful to see if any patterns exist • Compute the regression equation. Interpret. • Understand the difference between standard error of estimate, standard error of forecast (regression) and standard error of the regression coefficient. • Evaluate and interpret the adjusted R2 • Test the independent variables for significance • Evaluate the ANOVA and test the model for significance (F and DW) • Plot the error terms • Calculate a prediction and prediction interval • State final conclusions about the model (if running different models, compare using MSE, MAD, MAPE, MPE)

Regression Analysis