710 likes | 905 Views
Regression Analysis. . Introduction to Regression Analysis (RA). Regression Analysis is used to estimate a function f( ) that describes the relationship between a continuous dependent variable and one or more independent variables. Y = f(X1, X2, X3,, Xn) eNote: f( ) describes systematic varia
E N D
1. Spreadsheet Modeling & Decision Analysis A Practical Introduction to Management Science
2. Regression Analysis
3. Introduction to Regression Analysis (RA) Regression Analysis is used to estimate a function f( ) that describes the relationship between a continuous dependent variable and one or more independent variables.
Y = f(X1, X2, X3,…, Xn) + e
Note:
f( ) describes systematic variation in the relationship.
e represents the unsystematic variation (or random error) in the relationship.
4. In other words, the observations that we have interest can be separated into two parts:
Y = f(X1, X2, X3,…, Xn) + e
Observations = Model + Error
Observations = Signal + Noise
Ideally, the noise shall be very small, comparing to the model.
10. An Example Consider the relationship between advertising (X1) and sales (Y) for a company.
There probably is a relationship...
...as advertising increases, sales should increase.
But how would we measure and quantify this relationship?
11. A Scatter Plot of the Data
12. The Nature of a Statistical Relationship
13. A Simple Linear Regression Model The scatter plot shows a linear relation between advertising and sales.
14. Determining the Best Fit Numerical values must be assigned to b0 and b1
16. Using Solver...
17. The Estimated Regression Function The estimated regression function is:
18. Using the Regression Tool Excel also has a built-in tool for performing regression that:
is easier to use
provides a lot more information about the problem
19. The TREND() Function TREND(Y-range, X-range, X-value for prediction)
where:
Y-range is the spreadsheet range containing the dependent Y variable,
X-range is the spreadsheet range containing the independent X variable(s),
X-value for prediction is a cell (or cells) containing the values for the independent X variable(s) for which we want an estimated value of Y.
Note: The TREND( ) function is dynamically updated whenever any inputs to the function change. However, it does not provide the statistical information provided by the regression tool. It is best two use these two different approaches to doing regression in conjunction with one another.
20. Evaluating the “Fit”
21. The R2 Statistic The R2 statistic indicates how well an estimated regression function fits the data.
0<= R2 <=1
It measures the proportion of the total variation in Y around its mean that is accounted for by the estimated regression equation.
To understand this better, consider the following graph...
22. Error Decomposition
23. Partition of the Total Sum of Squares
29. An Example of Inappropriate Interpretation A study shows that, in elementary schools, the ability of spelling is stronger for the students with larger feet.
? Could we conclude that the size of foot can influence the ability of spelling?
? Or there exists another factor that can influence the foot size and the spelling ability?
30. Making Predictions
31. The Standard Error
33. An Approximate Prediction Interval An approximate 95% prediction interval for a new value of Y when X1=X1h is given by
34. An Exact Prediction Interval A (1-a)% prediction interval for a new value of Y when X1=X1h is given by
35. Example If $65,000 is spent on advertising:
95% lower prediction interval = 397.092 - 2.306*21.489 = 347.556
95% upper prediction interval = 397.092 + 2.306*21.489 = 446.666
If we spend $65,000 on advertising we are 95% confident actual sales will be between $347,556 and $446,666.
This interval is only about $20,000 wider than the approximate one calculated earlier but was much more difficult to create.
The greater accuracy is not always worth the trouble.
36. Comparison of Prediction Interval Techniques
37. Confidence Intervals for the Mean A (1-a)% confidence interval for the true mean value of Y when X1=X1h is given by
38. A Note About Extrapolation Predictions made using an estimated regression function may have little or no validity for values of the independent variables that are substantially different from those represented in the sample.
39. What Does “Regression” Mean?
40. What Does “Regression” Mean? Draw “best-fit” line free hand
Find mother’s height = 60”, find average daughter’s height
Repeat for mother’s height = 62”, 64”… 70”; draw “best-fit” line for these points
Draw line daughter’s height = mother’s height
For a given mother’s height, daughter’s height tends to be between mother’s height and mean height: “regression toward the mean”
41. What Does “Regression” Mean?
45. Multiple Regression Analysis Most regression problems involve more than one independent variable.
46. Example Regression Surface for Two Independent Variables
47. Multiple Regression Example:Real Estate Appraisal A real estate appraiser wants to develop a model to help predict the fair market values of residential properties.
Three independent variables will be used to estimate the selling price of a house:
total square footage
number of bedrooms
size of the garage
48. Selecting the Model We want to identify the simplest model that adequately accounts for the systematic variation in the Y variable.
Arbitrarily using all the independent variables may result in overfitting.
A sample reflects characteristics:
representative of the population
specific to the sample
We want to avoid fitting sample specific characteristics -- or overfitting the data.
49. Models with One Independent Variable With simplicity in mind, suppose we fit three simple linear regression functions:
50. Important Software Note When using more than one independent variable, all variables for the X-range must be in one contiguous block of cells (that is, in adjacent columns).
51. Models with Two Independent Variables Now suppose we fit the following models with two independent variables:
52. The Adjusted R2 Statistic As additional independent variables are added to a model:
The R2 statistic can only increase.
The Adjusted-R2 statistic can increase or decrease.
53. A Comment On Multicollinearity It should not be surprising that adding X3 (# of bedrooms) to the model with X1 (total square footage) did not significantly improve the model.
Both variables represent the same (or very similar) things -- the size of the house.
These variables are highly correlated (or collinear).
Multicollinearity should be avoided.
55. Model with Three Independent Variables Now suppose we fit the following model with three independent variables:
56. Making Predictions Let’s estimate the avg selling price of a house with 2,100 square feet and a 2-car garage:
57. Binary Independent Variables Other types of non-quantitative factors could independent variables could be included in the analysis using binary variables.
58. Polynomial Regression Sometimes the relationship between a dependent and independent variable is not linear.
59. The Regression Model An appropriate regression function in this case might be,
60. Implementing the Model
61. Graph of Estimated Quadratic Regression Function
62. Fitting a Third Order Polynomial Model We could also fit a third order polynomial model,
63. Graph of Estimated Third Order Polynomial Regression Function
64. Overfitting When fitting polynomial models, care must be taken to avoid overfitting.
The adjusted-R2 statistic can be used for this purpose here also.
65. Example: Programmer Salary Survey A software firm collected data for a sample of 20 computer programmers. A suggestion was made that regression analysis could be used to determine if salary was related to the years of experience and the score on the firm’s programmer aptitude test. The years of experience, score on the aptitude test, and corresponding annual salary ($1000s) for a sample of 20 programmers is shown on the next slide.
66. Example: Programmer Salary Survey Exper. Score Salary Exper. Score Salary
4 78 24 9 88 38
7 100 43 2 73 26.6
1 86 23.7 10 75 36.2
5 82 34.3 5 81 31.6
8 86 35.8 6 74 29
10 84 38 8 87 34
0 75 22.2 4 79 30.1
1 80 23.1 6 94 33.9
6 83 30 3 70 28.2
6 91 33 3 89 30
67. Example: Programmer Salary Survey Multiple Regression Model
Suppose we believe that salary (y) is related to the years of experience (x1) and the score on the programmer aptitude test (x2) by the following regression model:
y = ?0 + ?1 x1 + ?2 x2 + ??
where
y = annual salary ($000)
x1 = years of experience
x2 = score on programmer aptitude test
68. Example: Programmer Salary Survey Multiple Regression Equation
Using the assumption E (?) = 0, we obtain
E(y ) = ?0 + ?1 x1 + ?2 x2
Estimated Regression Equation
b0, b1, b2 are the least squares estimates of ?0, ?1, ?2. ?
Thus
y = b0 + b1x1 + b2x2.
69. Example: Programmer Salary Survey Solving for the Estimates of ?0, ?1, ?2
70. Example: Programmer Salary Survey Data Analysis Output
The regression is
Salary = 3.17 + 1.40 Exper + 0.251 Score
Predictor Coef Stdev t-ratio p
Constant 3.174 6.156 .52 .613
Exper 1.4039 .1986 7.07 .000
Score .25089 .07735 3.24 .005
s = 2.419 R-sq = 83.4% R-sq(adj) = 81.5%
71. Example: Programmer Salary Survey Computer Output (continued)
Analysis of Variance
SOURCE DF SS MS F P
Regression 2 500.33 250.16 42.76 0.000
Error 17 99.46 5.85
Total 19 599.79