350 likes | 501 Views
Determining Factors of Market Success. DMD #4 David Kopcso and Richard Cleary Babson College F. W. Olin Graduate School of Business. Learning Objectives. Determine the strength of (linear) relationships Describe a regression model with one or more explanatory variables
E N D
Determining Factors of Market Success DMD #4 David Kopcso and Richard Cleary BabsonCollege F. W. Olin Graduate School of Business
Learning Objectives • Determine the strength of (linear) relationships • Describe a regression model with one or more explanatory variables • Interpret regression coefficients • Evaluate the model in a business context
Modeling Relationships • If we believe that two (or more) random variables are related, then we would like to model and exploit the relationship. • Hopefully, this helps to: • Make more accurate predictions • Show the direction and strength of relationship • Reduce the amount of uncertainty.
Approach Investigate variables individually and jointly. IndividuallyJointly Numerically: Standard Stats Correlation Graphically: Histogram Scatter Plot Box Plot
Scatter Plots • Positive Linear Relationship • Nonlinear Relationship • Negative Linear Relationship • No Relationship
Correlation Coefficient • Unit free: ranges between -1 and 1 • The closer to –1, the stronger the negative linear relationship • The closer to 1, the stronger the positive linear relationship • The closer to 0, the weaker any linear relationship • Note: Correlation does not deal with cause and effect; it only measures strength of linear dependence
Correlation Coefficient r • r = +1 • Y • Y • r = -1 • X • X • r = +0.9 • r = 0 • Y • Y • X • X
Correlation MeasuresOnly Linear Dependence! X Y = exp(X) 1 3 2 7 3 20 4 55 5 148 6 403 7 1097 8 2981 9 8103 10 22026 11 59874 12 162755 13 442413 14 1202604 15 3269017 16 8886111 17 24154953 18 65659969 19 178482301 20 485165195 • X and Y are perfectly related. However, the correlation of X and Y (where Y=exp(X) ) is 0.539
Linear RegressionModel • Assume that the relationship between the variables is linear: • Slope • Y-Intercept • Error • Y • • • • • X • • • 0 • 1 • i • Dependent (Response) Variable • Independent (Explanatory) Variable
Model Do you think knowing the size of a house helps “explain” the variation in house prices? Population Model: Price = b0 + b1 Sq. Footage + e Estimated Equation: Est. Price = b0 + b1 Sq. Footage^or Price = b0 + b1 Sq. Footage
Y • • b • • b • X • • e • i • 0 • 1 • i • i • e • = Residual • i • ^ • Y • • b • • b • X • i • 0 • 1 • i Linear Regression Model • Y • Unsampled Observation • X
Estimated Model Est. Price = b0 + b1 Sq. Footage Est. Price = 117,663 + 173 Sq. Footage
Model Interpretation • b1: The average marginal increase/decrease in Price for a unit increase in Sq. Footage. • Price will increase by $173 on average for each additional square foot. • b0: The average Price when Square Footage equals zero. • Average value of Price is $117,663 when there is no Square Footage. • Does this statement make sense? Does this result have managerial significance?
Hypothesis Test: No Linear Relationship • Tests whether there is a (linear) relationship between X & Y • Hypotheses • H0: 1 = 0 (No Linear Relationship) • H1: 1 0 (Linear Relationship) • Compare p-value to a • Interpretation • If p-value is less than a, we have enough information to conclude that Square Footage is linearly related to Price and we can interpret the slope.
Quality of Model • We would like to know how well our model fits the facts (data). The better the fit, the more we believe in the model’s accuracy. • We have two measures of fit: R-squared and S (aka SEE).
2 • R The Famed R2 • Explained Variance • • Coefficient of determination (R2) • The closer the R2 to 1, the better the “fit” R2 is the percentage of variation of the Y variable that is explained by (accounted for by or reduced by) knowing the X variable (i.e., by using the regression to predict the response rather than the average response value). • Square Footage explains 92% of the variation in Price. • Total Variance
Accuracy: Standard Error of Estimate • Standard error of the estimate: S (or SEE) • The smaller the S, the better the “fit” • The units of S are the same as the units of the Y variable. • When using our regression model for predicting home prices, we would be off on average plus/minus $46,631.
In-Class Activity: • Investigate models of Salary from the file Salary_handout.xls using only one variable as the explanatory (independent) variable. • Interpret the following in the context of the model: • Slope and intercept • Strength of linear relationship (R) • The usefulness of the slope (p-value) • Graph of relationship • Evaluation of the model, i.e., R2 and S. • Use the model to predict Salary for a fictitious employee.
Multiple Linear Regression(MLR) • We assume that the relationship between variables is linear: • • • • Y • • • • • X • • X • • X • • 2 • 3 • 0 • 1 • 3 • 1 • 2
Model Building • Before running any regressions or even any data analysis, determine which of your variables you believe are good predictors. • Generally, you want at least 10 observations per variable selected if possible.
Variable Investigation • Next, investigate the relationship between the response or dependent (Y) variable and each of the explanatory or independent (X) variables. Use the correlation matrix and scatter plots. • To avoid ‘problems’, also make sure that the correlation among the explanatory (independent) variables is not too high. As a rule of thumb, anything above 0.90 in absolute value can cause trouble.
Multiple Regression Output • Est Price = 44,392 + 111*SqFt + 85,345*Bedrooms + 572*Bathrooms
Slopes in MLR • Be careful when interpreting the slopes in a multiple linear regression as it is necessary to hold all other variables constant. • Price will increase on average by $111 for each additional square foot when holding all other explanatory variables constant. If the p-value is not less than alpha, then you cannot interpret the slope.
p-values in MLR • Each explanatory (independent) variable has its own p-value. • When looked at individually, is the variable’s slope statistically different than zero? • If yes (p-value < a), then that variable is a good predictor within the context of the model and the slope can be interpreted. • If no (p-value > a), then that variable is not a good predictor within the context of the model and the slope can not be interpreted. Some variables are confirmatory and may remain in the model even though their p-value > a.
Coefficient of Determination R2 • R2 is still the percentage of variation of the Y variable explained by knowing all the X variables. The focus is on explaining the variation in Price, not on explaining the data.
Output • Knowing the square footage, the number of bedrooms and the number of bathrooms of a house, explains 97% of the variation in house prices.
Standard Error of Estimate in MLR • The interpretation of S (aka SEE) is the same in multiple regression as it is in simple. • Thus, we expect to be off on average plus or minus $28,765 when predicting house prices using the square footage, the number of bedrooms and the number of bathrooms in the house.
Variation Reduction • How do we know if the S (SEE) is low or high? Is it small enough to make the predictions from the regressions useful? Compare it to the standard deviation of the response (dependent) variable. S: (SEE) S: St Dev(Price) $28,765 vs. $161,666
Predicting Using Regression • Recall we have: • Assume this is a good equation. • Use it to predict the expected selling price of a home with 2000 sq. ft. of living space, 4 bedrooms, and 2 baths. • Est Price = 44,392 + 111*SqFt + 85,345*Bedrooms + 572*Bathrooms
How Confident Should You Be about Your Estimate? About two-thirds (68%) of the data should fall within +/- SEE of the value determined by the regression equation. Similarly about 95% should fall within 2*SEE. Therefore, a 95% interval for the prediction of a specific house at 533 Main St. which has2000 sq. ft., 4 bedrooms, & 2 baths can be computed as Est Price +/- 2*SEE.That is, we are 95% confident that this specific house’s price is between these two values.Since this is about a specific house, the interval is called a prediction interval not a confidence interval.
How Confident Should You Be about the Average Price of a Such a House? • A 95% confidence interval for the average price of a 2000 sq. ft., 4 bed, 2 bath house can be computed as: • Est Price +/- 2 *SEE/SQRT(n). • In words, based on our regression, we are 95% confident that the average 2000 sq. ft., 4 bed, 2 bath house price is between these two values. Since this is about an average of all such houses, the interval is called a confidence interval not a prediction interval.
In-Class Activity: • Investigate models of Salary from the file Salary_handout.xls using any set of variables you wish as the explanatory (independent) variables. • Interpret the following in the context of the model: • Slopes and intercept. • Strength of linear relationship (R) • The usefulness of the slope (p-value). • Graphs of relationship. • Evaluation of the model, i.e., R2 and SEE. • Use the model to predict Salary for a fictitious employee and build Prediction and Confidence intervals for this prediction.