410 likes | 578 Views
Basics of Regression Analysis. Determination of three performance measures. Estimation of the effect of each factor Explanation of the variability Forecasting Error. Two Predictor Variables. Population Regression Model:
E N D
Determination of three performance measures • Estimation of the effect of each factor • Explanation of the variability • Forecasting Error
Two Predictor Variables Population Regression Model: Y = b0 + b1X1 + b2X2 + e , e following N(0, s) Unknown parameters: b0, b1, b2; s
From Data to Estimates of Coefficients Principle: Least Squares Normal Equation Systems Estimates of Coefficients Mathematics Computing Algorithm
Least Squares Method Simple Regression Multiple Regression
Matrix Computation for b • Normal Equation System: (XTX) b = XTY • See Text Appendix D.3 • Solution for b: b = (XTX)-1 (XTY)
Standardized Regression Coefficients, • Definition • b0 = 0 • the beta coefficient • Used to show relative weights of predictors. for k = 1, 2
2 SSE = Y - Y i i SSE MSE = se 2 = (n-3) Estimation of s, se - Standard Deviation of Disturbancee • Forecasting Equation • SS of Residuals • Mean SS n S i=1
Standard Error of Coefficients • The variance matrix of b (K+1 x 1)is
The Variability Explained • First, determine the base variability for explanation by the regression Unconditional mean model: Y =my + e e followsN(0, sy) LS fit of the model: Pred_Y = Y SS of Residuals: MSS (DF=n-1):
The Variability Explained – cont. • Second, by subtraction of the variability for still left. • In SS: • In Variance :
Test of Significance • F test of significance • T- Test of significance • Two sided alternative • One sided alternative
F - Test of Significance of the variability explained by the regression H0: b1= b2 = 0 Ha: At least one coefficient is not 0 P-Value of F-stat = P{F(2, n-3)> F-stat}
t-Test of Significanceof significance of a variable, X1 - two sided H0: b1 = 0Ha: b1 = 0 P-Value of t-stat = P{ t( n-3)> |t-stat|}
One Sided Test of Significanceof significance of a variable, X1 H0: b1 = 0Ha: b1 > 0 (using the prior knowledge) p-Value of t-stat = P{ t( n-3)> t-stat}
Forecasting • Point forecasting • Sources of forecasting error • Interval forecasting
Forecasting at xm Data of X for regression Value of X for prediction
Sources of Forecasting Error • Data: Y|xm = b0+ b1 x1m + b2 x2m + em • Forecast: • Forecast Error:
Forecasting Performance Analysis • R2_pred = 1 – Press / SST Press = SS of {yi – yi(i)} (deleted residual) • Sample splitting • Analysis sample (n1) • Validation sample (n2)
Generalization to K Independent Variables • Use n – K – 1 for n – 3 for DF for t. • Use K for the numerator DF and n-K-1 for the denominator DF for F.
Diagnostics • Assumptions for Disturbance • Multi-collinearity • Outliers and Influential Observations
Problematic Data Conditions • Regression Coefficients Are Sensitive to: • Highly Collinear Independent Variables • Contamination By Outliers and Influential Observations
DetectingOutliers and Influential Data • Outliers • Leverage (X-space) distance from the mean • Tresid (Y-space) forecasting error • Influential Data • Idea: with / without comparison • Cook’D • Dfbetas • Dfits
Modeling Techniques • Transformation of Variables • Log • Others • Using Dummy Variables • Symbolic representation • Dummy variables for qualitative variables • Using Scores for Ordinal Variables • Selection of Independent Variables • Forecasting • Computer intensive • Analysis of correlation structure of independent variables
Dummy Variables • DK= “If (X=k,1,0)” • Can be used nominal and also ordinal variables • # of DK = c-1 where c is the number of categories.
Using Scores for Ordinal Variable • Scoring Systems • 1, 2, 3, …c • -2, -1, 0, 1, 2 c:odd
Selection of Variables - 1 • Backward elimination • Stepwise (forward) inclusion T-test All X’s Final Regression Best simple Best Two variables Best …. variables Max Increase in R2 Max Increase in R2
Selection of Variables - 2 • All Possible Regression K simple K (K-1) two variable K independent variables Final Regression 1 K variable
Selection Criteria • R2___________________________ • Adj. R2 ______________________ • R2PRED ______________________ • Se __________________________ • Cp___________________________
Cp(= # of coefficients) Select a combination with Cp close to p
What to Look for in Good Regression? • Remember the three functions of regression • Estimation of the effect of each X • Explaining the variability of Y • Forecasting • Populations regressions are assumptions • Needs testing • Data might be contaminated
Types of Variable Continuous Quantitative Discrete (counting) Variable Ordinal Qualitative Nominal
Generalized Linear Models (GLM) • Regression model: Y = b0 + b1X1 + b2X2 + e , e following N(0, s) • GLM Formulation: • Model for Y: Y is N(m, s) • Model for predictors (Link Function): m = b0 + b1X1 + b2X
Forecasting Counting Data • Model for Y: Poisson Distribution (m) • Link Function: