350 likes | 360 Views
Understand the linear relationship between variables Y and X to create accurate regression models. Learn the principles of least squares method and scatter diagrams for predictive analysis.
E N D
Warsaw Summer School 2019, OSUStudy Abroad Program Regression
Linear Relationship The line = a mathematical function that can be expressed through the formula Y = a + bX, where Y & X are our variables. Y, the dependent variable, is expressed as a linear function of the independent (explanatory) variable X.
Linear Relationship The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept). The slope b equals the change in Y for a one-unit increase in X (one-unit increase in X corresponds to a change of b units in Y). The slope describes the rate of change in Y-values, as X increases. Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change in the vertical distance is divided by the change in the horizontal distance).
Linear Relationship The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept). The slope b equals the change in Y for a one-unit increase in X (one-unit increase in X corresponds to a change of b units in Y). The slope describes the rate of change in Y-values, as X increases. Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change in the vertical distance is divided by the change in the horizontal distance).
Cartesian Coordinate System Variables X, Y and their linear function: The formula Y = a + bX expresses the dependent (response) variable Y as a linear function of the independent (explanatory) variable X. The formula maps out a strait-line graph with slope b and Y-intercept a.
Basics Linear Relationship: Y = a + bX The constant a is the value of Y when X = 0. For X = 0 we have: Y = a + b*0 = a The constant a is the value of Y where the line Y = a + bX intersects the Y-axis. The slope b equals the change in Y for a one-unit increase in X. This means that one-unit increase in X corresponds to a change of b units in Y. Thus, the slope describes the rate of change in the Y-values as X increases. Generally, b = (Y - a) / X
Model vs Reality The function Y = a + bX is a model In reality we do not have one line
The Scatter gram and Least Squares Method The graphical plot of observed values (X,Y) is called a - scatter-gram - scatter-diagram - scatter-plot. A regression function is a function that describes how theexpected value of the dependent (response) variable Y changes according to the values of an independent (explanatory) variable X.
Regression This expected value is estimated by a linear function: • Ý = a + bX Ý = predicted value for the dependent variable, Y a = the intercept (the value of Y when X = 0) b = the regression coefficient (the slope), indicating the amount of change in Y given a unit change in X X = the independent variable
Regression Ý = a + bX b = [Σ(X - X̃)(Y - Ÿ)] / Σ(X - X̃)2 a = Ý - b*X
Method of Least Squares The prediction errors, called residuals, are defined as the differences between observed and predicted values of Y E = Ý - (a + bX) = Y - Ý Regression line minimizes the sum of error terms: SSE = Σ(Y - Ý)2
Method of Least Squares The method of least squares provides the prediction equation Ý = a + bX having the minimal value of SSE. The least square estimates a and b are the values determining the prediction equation for which the sum of squared errors SSE is a minimum.
Covariance In the regression analysis we ask: to what extend could we predict Y knowing our variable X? Prediction means that values X and Y go together or co-vary. Covariance is sum of products, or SP, • SP = Σ (X - X̃) (Y - Ÿ) Sums of squares for X: • SSx = Σ (X - X̃)2 Note that in the regression equation of Y on X • Ý = bX + a • b = SP / SSx
Interpretation of b The slope of the line, b, has the verbal interpretation “rise over run”-- that is, the rise divided by the run. This means that the change in the vertical distance is divided by the change in the horizontal distance. The more steep the hill, the higher the slope. You go “up” more rapidly than you go over. The line can have a negative slope. When there is negative slope, you are going “downhill” rather than “uphill.” • b > 0, positive relationship • b < 0, negative relationship • b = 0, no relationship
Linear Relationship The constant a = value of Y at the point in which the line Y = a + bX intersects the Y-axis (also called the intercept). The slope b equals the change in Y for a one-unit increase in X (one-unit increase in X corresponds to a change of b units in Y). The slope describes the rate of change in Y-values, as X increases. Verbal interpretation of the slope of the line: “Rise over run”: the rise divided by the run (the change in the vertical distance is divided by the change in the horizontal distance).
Unststandardized and standardized coefficients If both variables, IV and DV, are expressed in z-scores, a (constant) is equal zero. We obtain Beta coefficients that tell us the following: How much change in the standard deviation units in DV is attributable to the change in IV by one standard deviation.
Two and more IVs Ý = a + b1X1 + b2X2 Ý = β1X1 + β2X2 Ý = a + b1X1 + b2X2 ………..bk-1Xk-1 + bkXk Ý = β 1X1 + β 2X2 ………..β k-1Xk-1 + β kXk
Coefficients and variables The estimated parametersb1, b2, ..., bk are partial regression coefficients. They are different from regression coefficients for bi-variate relationships between Y and each exploratory variable. Three criteria for a number of independent (exploratory) variables: • (1) Theory • (2) Parsimony • (3) Sample size
R2 Coefficient of determination (explained variance) for two variables SS(total) - SS(error) • r2 = ----------------------------- SS(total) • Stata provides a value of the coefficient of determination for SS(total) - SS(error) • R2 = ----------------------------- SS(total)
Sum of squares R2 is a proportion of explained variance by X1, X2, ...., Xk. Therefore, 1 - R2 is a proportion of unexplained variance.
Adjusted R-square • Adjusted R-square is a modification of R-square that adjusts for the number of terms in a model. R-square always increases when a new term is added to a model, but adjusted R-square increases only if the new term improves the model more than would be expected by chance.
Sum of Squares The Regression SUM of SQUARES is defined: SS(regression) = SS(total) – SS(error)
Mean square The Regression MEAN SQUARE MSS(regression) = SS(regression) / df-v df-v = k where k is a number of variables The MEAN SQUARE ERROR MSS(error) = SS(error) / df df-t = n - (k + 1) where n is a number of cases and k is a number of variables.
F The null hypothesis Ho: b1 = b2 = … = bk = 0 MSS(model) • F = -------------- MSS(error) The sampling distribution of this statistic is the F-distribution
t The test of H0: bk = 0 evaluates whether Y and X are statistically dependent, ignoring other variables. We use the t statistic b • t = -------------- σB where σB is a standard error of B SS(error) • σB = -------- n - 2
ANOVA ANALYSIS OF VARIANCE • How much of the variance is explained by values of the nominal variable? • Total sum of squared variation from the mean: • SS(total) = Σ [X – X̃(total)]2
ANOVA The between group variation represents the squared deviations of every group mean from the total mean: • SS(between) = Σ [X̃(group) – X̃(total)]2 The within-group sum of squares is the sum of every raw score from its group mean: • SS(within) = Σ [X – X̃(group)]2
ANOVA Mean Squares: • MSS(between) = SS(between) / df(between) where df(between) = k – 1 • MSS(within) = SS(within) / df(within) where df(within) = N - k
F F-statistic MSS(between) • F = -------------- MSS(within) • The larger the F-value, the greater the impact of a group on the dependent variable.
F Compare: MSS(between) • F = -------------- MSS(within) MSS(regression) • F = -------------- Regression ANOVA MSS(error)
ANOVA • Source - Model, Residual, and Total. The Total variance is partitioned into the variance which can be explained by the independent variables (Model) and the variance which is not explained by the independent variables (Residual, sometimes called Error). • SS - Sum of Squares associated with the three sources of variance, Total, Model and Residual. • df - Degrees of freedom associated with the sources of variance. The total variance has N-1 degrees of freedom. The model degrees of freedom = the number of coefficients + intercept minus 1. The Residual degrees of freedom is the DF total minus the DF model. • MS - Mean Squares, the Sum of Squares divided by their respective DF.
Regression • Number of observations used in the regression analysis. • The F-statistic is the Mean Square Model divided by the Mean Square Residual. The numbers in parentheses are the Model and Residual degrees of freedom. • Prob > F - This is the p-value associated with the above F-statistic. It is used in testing the null hypothesis that all of the model coefficients are 0. • R-squared - R-Squared is the proportion of variance in the dependent variable which can be explained by the independent variables. • Adj R-squared - This is an adjustment of the R-squared that penalizes the addition of extraneous predictors to the model. Adjusted R-squared is computed using the formula 1 - ((1 - Rsq)((N - 1) /( N - k - 1)) where k is the number of predictors. • Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Squared Error).