300 likes | 500 Views
Welcome to BUAD 310. Instructor: Kam Hamidieh Lecture 23, Monday April 21, 2014. Agenda & Announcement. Today: Continue with Multiple Regression Talk about the Case Study due on Wednesday April 30 th . Pass back the exams & talk about the exam (time permitting)
E N D
Welcome to BUAD 310 Instructor: Kam Hamidieh Lecture 23, Monday April 21, 2014
Agenda & Announcement • Today: • Continue with Multiple Regression • Talk about the Case Study due on Wednesday April 30th. • Pass back the exams & talk about the exam (time permitting) • Homework 7 will be posted soon. It is due Friday May 2, 5 PM. • Reading: • Read all of 23 carefully but you can skip the path diagram stuff. • Read all of 24, but you can lightly read the topic of VIF (Variance Inflation Factor) BUAD 310 - Kam Hamidieh
Some Important Dates • Case Study due on Wednesday April 30 • Homework 7 due on Friday, May 2, 2014 • Final Exam on Thursday May 8th, 11 AM – 1:00 PM, in room THH 101. See http://web-app.usc.edu/maps/(I recommend you scope out the location before the exam.) BUAD 310 - Kam Hamidieh
Some Fun Stuff http://blogs.wsj.com/atwork/2014/04/15/best-jobs-of-2014-congratulations-mathematicians/?mod=e2fb(Jake S. and William C.) http://fivethirtyeight.com/features/the-toolsiest-player-of-them-all/(Joshua C.) BUAD 310 - Kam Hamidieh
Multiple Regression Model • The observed response Y is linearly related to k explanatory variables X1, X2, …, and XK by the equation: • The values for B0, B1, …, BK are estimated via least squares method; Pick b0, b1 ,…, bkso the quantity below is as small as possible: BUAD 310 - Kam Hamidieh
Model Residuals • Residuals are defined just like the simple linear regression case: residual = observed – fitted. • The official formula: BUAD 310 - Kam Hamidieh
Previous Example b0 ≈ 23.73 b1≈ -1.59 b2≈ -0.018 b3≈ 0.0004 b4≈ -0.00075 n – k – 1 = 372 – 4 – 1 = 367 (k = # of predictors) MSE = 1.55 Se = 1.24 (estimate of σɛ ) SSE = 567.80 APR = 23.73 - 1.59(LTV) - 0.018(CreditScore) + + 0.0004(StatedIncome) - 0.00075(HomeValue) BUAD 310 - Kam Hamidieh
Solution to In Class Exercise 1 from Lecture 21 Part (1) (1) Predictor Y = 10.07, Response: LTV = 0.942, Credit Score = 640, Stated Income = 100000, Home Value = 305000 (2) Y10 = 12.87, X24 = 450000, X11,3 = 70000 Part (2) When stated income goes up by $1000, while holding all other predictors fixed, on average APR goes up by 0.0004%. APR = 23.73 – 1.59(1/2) – 0.018(600) + 0.0004(10) – (0.00075)(200) ≈ 12% BUAD 310 - Kam Hamidieh
Solution to In Class Exercise 2 from Lecture 21 (1) Y observed = 10.7 Fitted Y (APR) = 23.73 – 1.59(0.942) – 0.018(640) + 0.0004(100) – (0.00075)(305) = 10.52 Residual = 10.07 – 10.52 = -0.45 (2) Same as APR’s units so in % (3) -0.45/1.242 ≈ -0.36 standard deviation united below the estimated equation BUAD 310 - Kam Hamidieh
Partition of the Total Variability • Y values have variability. • One way to measure this variability is to see how your Y values vary from your overall mean of Y’s. • It can be shown – not at all obvious! – that: The regression or AKA model + … Total variation in Y’s is accounted for by …. Leftovers or residuals or “errors” BUAD 310 - Kam Hamidieh
Partition of the Total Variability SSE SSR SST • SST = Sum of Squares Total, Total variation in Y values • SSR = Sum of Squares Regression, Variation account for by the regression (SSM is used too!) • SSE = Sum of Squares Error, Left over variation BUAD 310 - Kam Hamidieh
Summarizing Results in a Table MSR = Mean squared (due to) Regression = SSR/k SSR SST BUAD 310 - Kam Hamidieh
(Multiple) Coefficient of Determination • The coefficient of determination R2 is defined as: • Its value tells us the percentage of variation in your response value accounted for (or explained by) the regression onto your predictor values. • What is the difference between r2 from simple linear regression and R2 from multiple regression? BUAD 310 - Kam Hamidieh
Summarizing Results in a Table About 46% of the variation in the APR values are accounted for (or explained by) the regression onto the predictor variables LTV,…, Home Value. BUAD 310 - Kam Hamidieh
Issues! • It can be shown that adding more variables to the model will always inflate R2. (See page 621 of your book for an intuitive discussion.) • Remedy: use adjusted R2: • The adjusted R2 now compensates for this issue. HOW/WHY? • The adjusted R2also makes it easier to compare models. (More on this later.) • However, the “% variation accounted for” interpretation does not apply for the adjusted R2. BUAD 310 - Kam Hamidieh
Adjusted R Squared Here it is! Verify the formula! BUAD 310 - Kam Hamidieh
The F-Test • If the multiple regression seems reasonable, one of the first “tests” you usually carry out is the “F-Test”:H0: B1 = B2 = … = Bk = 0Ha: At least one of Bi’s ≠ 0 • Informally, null says “the predictors are useless” vs. alternative model “at one of the predictors is useful.” BUAD 310 - Kam Hamidieh
Regression ANOVA Table Here it is F statistics & its p-value. Since: P-Value < 0.05 We see that at least one of the predictors is significant. ANOVA Table BUAD 310 - Kam Hamidieh
Many Thanks to… One of the “giants” of statistics. Many things are named after him. From Wiki:Anders Hald called him "a genius who almost single-handedly created the foundations for modern statistical science” while Richard Dawkins named him "the greatest biologist since Darwin". Ronald Fisher BUAD 310 - Kam Hamidieh
In Class Exercise 1 • This will be handed out in class. BUAD 310 - Kam Hamidieh
Looking at Individual Coefficients • We want to determine the statistical significance of a single predictor in the model. Why? • We want to test for jth predictor:H0: Bj = 0Ha: Bj ≠ 0 • We have two options: • Get a p-values • Get a confidence interval for Bj BUAD 310 - Kam Hamidieh
Looking at Individual Coefficients For testing H0: Bj = 0 versus Ha: Bj ≠ 0 • Use the output, to get the test statistics and now compute p-value by looking at t-distribution with df = n – k – 1, and compare with your α • Create a 100(1-α)% CI: where tα/2 comes from a t-distribution with df = n – k - 1 BUAD 310 - Kam Hamidieh
Our Example, P-Values se(b1), t-statistics, and p-value for LTV variable se(b2), t-statistics, and p-value for CreditScore variable se(b3), t-statistics, and p-value for StatedIncome variable se(b4), t-statistics, and p-value for HomeValue variable How about 95% confidence intervals? BUAD 310 - Kam Hamidieh
Looking at Individual Coefficients • Looking at the previous slide, we see that LTV and CreditScore are statistically significant predictors. • Should we throw away the non-significant predictors? • Important: The tests for the individual regression coefficients (or predictors) assess the statistical significance of each predictor variable assuming that all other predictors are included in the regression. • It’s possible that you throw away a non-significant predictor, and your results for other predictors change! BUAD 310 - Kam Hamidieh
Variable Selection • Variable selection is intended to select the “best” subset of predictors. • Motivation: • We want to select the simplest model that gets the job done. • We can avoid “multicollinearity”. More on this later. • Practical matters! Like what? • Can we simplify our subprime model? BUAD 310 - Kam Hamidieh
Variable Selection Methods • Entire books are written on variable selection! • Here’s the simplest method, called the backward elimination: • Start with the largest model (has all the predictors) • Remove the predictor with the largest p-value greater than αcrit. This is usually around 0.10 to 0.20. (Why not 0.05?) • Stop when all non-significant predictors have been removed. • What happens in our example? BUAD 310 - Kam Hamidieh
Backward Elimination StatedIncome & HomeValue are removed. BUAD 310 - Kam Hamidieh
Full Model (Left) vs. New Model (Right) APR = 23.73 - 1.59(LTV) - 0.018(CreditScore) + + 0.0004(StatedIncome) - 0.00075(HomeValue) APR = 23.69 - 1.58(LTV) - 0.019(CreditScore)) In Summary: The remaining coefficients in the new model do not change much. Se and R2 go down only slightly. BUAD 310 - Kam Hamidieh
Other Variable Selection • Forward selection: add in variables with the lowest p-value first (opposite of backward) • Criterion based: pick the model with the best “criterion” such as adjusted R squared. • All subsets!!! Try out every single combination and pick the model with the best “criterion”. You can use adjusted R squared as an example. • The cutting edge seems to be LASSO = Least Absolute Shrinkage and Selection Operator (Take more stats) BUAD 310 - Kam Hamidieh
In Class Exercise 2 This is just the continuation of in class exercise 1. BUAD 310 - Kam Hamidieh