330 likes | 416 Views
4. Regression Analysis. Finding an equation that best explains a functional relationship between two or more related random variables. Remember Graphing Functions In High School…. Y is said to be a function of X; that is; if you know the value of X, you know exactly what Y is.
E N D
4. Regression Analysis Finding an equation that best explains a functional relationship between two or more related random variables.
Remember Graphing Functions In High School…. Y is said to be a function of X; that is; if you know the value of X, you know exactly what Y is. Y = mX + b Y m = slope = rise/run = ∆Y/∆X You usually start with a function, then graph it. b X
Imperfect Functional Relationships What if you had data points that suggested a statistical, functional relationship? Now you would like to draw a line that inexactly “fits” the data points, similar to what you did with a function. more staff > more sales
The function would look something like this … We still have the same math relationships as before….. Y Y = m X + b …but instead of Y = mX + b; an exact functional relationship… Ŷ = b0 + b1X …where the “hat” on Y indicates it is a predicted value only! …we have… X Ŷ = b0 + b1X
The Estimated Function is… The function that relates the actual data points is… Y = β0 + β1X + ε Y …where the “ε” term is the vertical error in the estimate. Ŷ = b0 + b1X …where the “hat” on Y indicates it is a predicted value only! X ε = Y - Ŷ
Statistical Estimates from Previous Example So, b1 = 12.5/10 = 1.25, and b1 = 7 - 1.25(4) = 2 Question: In terms of statistics (rather than calculations), what is the estimate of b1?
Sum of Squares: Example Note: SST = SSE + SSR
Coefficient of Determination “r2” The SSR is sometimes called the explained variability in Y while the SSE is the unexplained variability in Y. The proportion of the variability in Y that is explained by the regression is called the coefficient of determination, or “r2”. In the example, r2 = 15.625/22.5 = 0.6944, meaning that about 69% of the variability in Y was explained by the regression.
Correlation Coefficient The correlation coefficient “r” is simply the (positive or negative) square root of the coefficient of determination. • - 1 ≤ r ≤ + 1 • r > 0 when slope is positive, and r < 0 when slope is negative.
Estimating the Variance The variance σ2 is typically not known. Its estimate is known as the Mean Squared Error (MSE), and is denoted by s2. Where n is the number of observations, and k is the number of independent variables In the example, MSE = 6.8750/(6-1-1) = 1.7188
Standard Error The square root of the MSE is called the standard error of the estimate, or the standard deviation of the regression. The standard error is used in many tests of the model. In the example, s = √(1.7188) = 1.31
Hypothesis Testing In statistics, rather than trying to prove that a relationship is important, we try to disprove, or reject, the idea that the relationship is not important. Null Hypothesis (H0): no relationship between X and Y. Alternate Hypothesis (H1): there is a relationship between X and Y. H0: β1 = 0; H1: β1 ≠ 0 We want to test to see if we can reject H0; if we reject H0, we would accept H1.
The F-Test for Significance (reject H0) Define the Mean Squared Regression MSR: MSR = SSR/k, where “k” is the number of independent variables in the model. The “F-statistic” is then computed as: F = MSR/MSE where 0 ≤ F ≤ 1 = explained variability/unexplained variability In the example, MSR = SSR/k = 15.625/1 = 15.625 And F = 15.625/1.7188 = 9.0909
Analysis of Variance (ANOVA) Table Regression analysis will generate several statistics that can be used to test significance and other important aspects of the regression. Using statistical software (like SASS), these statistics will be summarized in an ANOVA table like the one below.
Multiple Regression Analysis In multiple regression analysis, there is more than one independent variable. The technique expands naturally for k > 1 independent variables as: Y = β0 + β1X1 + 21X2 + …+ βkXk + ε with sample estimate: Ŷ = b0 + b1X1 + b2X2 + …+ bkXk
Textbook Example of Multiple Regression: House selling price is a function of size (square footage) and age. (Condition will be introduced later)
Example (cont.): Multiple Regression Estimate Ŷ = b0 + b1X1 + b2X2 where X1 = sqr. Ft. and X2 = age Regression: Ŷ = 60,815 + 22X1 – 1,449X2 Thus, for example, a 1,900 sqr. ft. house that is 10 years old is estimated to cost = 60,801 + 22(1,900) – 1,449(10) = $117,105
House-Price Example (cont.) With Dummy Variables for Condition Same example and data as before, but add the house condition into the regression with dummy variables: X3 = 1 for “excellent”; X3 = 0 otherwise X4= 1 for “mint” (i.e. perfect); X4= 0 otherwise *Note: By implication: X3 = X4 = 0 for “good” Regression: Ŷ = 48,329 + 28.2X1 – 1,981X2 + 16,581X3 + 23,684X4
Textbook Example of Non-Linearity: How far a car can go with its petrol is a non-linear function of its weight. (Heavier cars don’t go as far!) MPG = b0 + b1(weight) + b2X2 where X2 = (weight)2 Regression: Ŷ = 79.8 – 30.2X1 + 3.4X2 “MPG” = miles per gallon; i.e. how far can the car go with petrol.