240 likes | 337 Views
SOC 206 Lecture 2. Logic of Multivariate Analysis Multiple Regression. Multivariate Analysis. Why multivariate analysis? Nothing happens by a single cause If it did – it would imply perfect determinism it would imply perfect/divine measurement
E N D
SOC 206 Lecture 2 Logic of Multivariate Analysis Multiple Regression
Multivariate Analysis • Why multivariate analysis? • Nothing happens by a single cause • If it did – it would imply perfect determinism • it would imply perfect/divine measurement • it would be impossible to separate cause from effect (where does effect start and where does cause end) • Social reality is notoriously multi-causal even more than certain physical/chemical/biological processes • People are not just objects but also subjects of causal processes – reflexivity, agency, framing etc. (Some of these are hard to capture in statistical models.)
John Stuart Mill’s 3 Main Criteria of Causation (recall) • #1. Empirical Association • #2. Appropriate Time Order • #3. Non-Spuriousness (Excluding other Forms of Causation) • Mill tells us that even individual causal relationships cannot be established without multivariate analysis (#3). • Suppose we suspect X causes Y Y=f(X,e) • Suppose we establish that X is related to Y (#1) and X precedes Y (#2). • But what if both X and Y are the result of Z a third variable: • E.g. Academic Performance=f( Poverty, e) • If that were true redistributing income should help academic achievements. • But maybe both are the result of parents education (a confounding factor) - e Academic Performance Poverty Poverty e2 e1 Poverty Academic Performance + - Parents’ Education
Excluding other Forms of Causationor Eliminating Confounding Factors • Eliminating or “controlling for” other, confounding factors (Z) • Experiments -- treatment (X) is introduced by researcher: • 1. Physical control • Excluding factors by physical design – physical control of Zs • 2. Randomization • Random assignment to treatment and control – randomized control Zs • Observational research – no manipulation by researcher • 3. Quasi-experiments • Found experiments – choice of cases that are “minimum pairs”: they are the same on most confounding factors (Zs) but they are different in the treatment (X) • 4. Statistical manipulation • Removing the effect of Z from the relationship between Y and X • Organizing data into groups homogenous by the control variable Z and looking at the relationship between treatment X and response Y • if Y still moves together with X it cannot be because they are moved by Z: Z is constant. If Z is the cause of Y and Z is constant Y must be constant too. • Residualizing X on Z then residualizing Y on Z. That leaves us with that part of X and Y that is unrelated to Z. If the two residualized variables still move together, that cannot be because they are moved by Z.
Residualizing • Remember: in a regression the error is always unrelated to the independent variable(s) • Residualizing
Multiple Regression with Two Independent variables • Yi=a+b1Xi+b2Zi+ei • or • Yi=a+b1X1i+b2X2i+ei • To obtain a, b1, and b2 we first calculate β*1 and β*2 from the standardized regression. • Then we transform them into their metric equivalents • Finally we obtain a with the help of the means of Y, X1 and X2 .
We multiply each side by ZX1i We sum across all cases and divide by n We get our first normal equation (for the correlation between Y and X1 ). We get an expression for β*1 . We multiply each side by ZX2i . Repeat what did. We get our second normal equation (for the correlation betweenY and X2 ). Plugging in for β*1 . Both standardized coefficients can be expressed in terms of the three correlations among Y,X1 andX2 . Finding the Standardized (Path) Coefficients 1. 2.
Finding the Unstandardized (Metric) Coefficients • We multiply each standardized coefficient by the ratio of the standard deviation of the dependent variable and the independent variable to which it belongs. • Take the two normal equations: • What do we learn from the normal equations? • If either β*2 =0 or rx1x2=0 , the unconditional effect does not change once we control for X2. • We get suppression only if β*2≠0 and rx1x2 ≠ and • of the opposite signsif the unconditional effect is positive and of the same signs if the unconditional effect is negative. • The correlation (unconditional effect) of X1 or X2 on Y can be decomposed into two parts. Take X1 • The direct (or net) effect of X1 on Y (β*1 ) controlling for X2 • and something else that is the product of the direct (or net) effect of X2 (β*2 ) on Y and the correlation between X1 and X2 (rx1x2), the measure of multicollinearity between the two independent variables.
Path Analysis • AP=f(P,e1) ZAP= β*’1 ZP+e1 • AP=f(P,PE,e) ZAP= β*1 ZP+ β*2 ZPE+ e e1 Poverty Academic Performance β*’1 e Academic Performance Poverty β*1 β*2 Parents’ Education
The Multiple Regression Model . regress API13 AVG_ED MEALS, beta Source | SS df MS Number of obs = 10173 -------------+------------------------------ ------------------- F( 2, 10170) = 4441.76 Model | 49544993 2 24772496.5 Prob > F = 0.0000 Residual | 56719871.2 10170 5577.17514 R-squared = 0.4662 -------------+------------------------------ --------------------- Adj R-squared = 0.4661 Total | 106264864 10172 10446.8014 Root MSE = 74.68 ---------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta -------------+-------------------------------------------------------------------------------------- AVG_ED | 114.9596 1.695597 67.80 0.000 .853387 MEALS | .8187537 .0461029 17.76 0.000 .2235364 _ cons | 416.4326 7.135849 58.36 0.000 . ------------------------------------------------------------------------------------------------------------ • . correlate AVG_ED API13 MEALS, means • (obs=10173) • Variable | Mean Std. Dev. Min Max • - ------------+------------------------------------------------------------------------ • AVG_ED | 2.781778 .758739 1 5 • API13 | 784.182 102.2096 311 999 • MEALS | 58.57338 27.9053 0 100 • | AVG_ED API13 MEALS • ------------------+--------------------------- • AVG_ED | 1.0000 • API13 | 0.6706 1.0000 • MEALS | -0.8178 -0.4743 1.0000
Basic Path Analysis ryx1 =β*’1 =-.4743 e1 Poverty β*’1 =-.4743 Academic Performance e Academic Performance Poverty β*1=.2235364 rx1x2=β*’1=-.8178 β*2=.853387 Parents’ Education Spurious indirect effect
Basic Path Analysis ryx2 =β*’2 =. 6706 e1 Parents’ Education β*’2 =. 6706 Academic Performance e Academic Performance Poverty β*1=.2235364 rx1x2=β*’1=-.8178 β*2=.853387 Parents’ Education Indirect effect
Fit (R-square) • Venn diagram • R-square= Unique contribution by X1 + unique contribution by X2 + common contribution by both X1 and X2 • Multicollinearity • Unique contributions are small, statistically non-significant, still R-square is large because of the common contribution is large. y x2 x1 y x2 x1
Nested Regression Equations • Comparing theories • How much a theory adds to an already existing one • Calculating the contribution of a set of variables ----- R2 • Where R12 is the fit of the smaller model and R22 is the fit of the full model • and K1is the number of independent variables in the smaller model andK2is the number of independent variables in the full model • and N is the sample size. • Warning: You have to make sure you use the exact same cases for each model!
Adjusted R-square • Adding a new independent variable will always improve fit even if it is unrelated to the dependent variable. • We have to consider the parsimony (number of independent variables) of the model relative to the sample size. • For N=2, a simple regression will always have a perfect fit • General rule: N-1 independent variables will always result in R-squared of 1 no matter what those variables are • Adjusted R-square
Multiple Regression with K Independent Variables • Yi=a+b1X1i+b2X2i+....+bkXki+ei • If we standardized Y, X1… Xk turning them into Z scores we can re-write the equation as • Zyi=β*1Zx1i+ β*2Zx2i+… +β*kZxki+ei • To find the coefficients we have to write out k number of normal equations one for each correlation between each independent variable and the dependent variable • ryx1=β*1+ β*2 rx1x2+…..+β*k rx1xk • ryx2= β*1rx1x2+β*2+…..+β*k rx2xk ………………. • ryxk= β*1rx1xk +β*2 rx2xk+…..+β*k • and solve k equations for k unknowns (β*1, β*2…. β*k)
The Correlations . correlate API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR (obs=10082) | API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS ----------------+------------------------------------------------------------------------------------------ API13 | 1.0000 MEALS | -0.4876 1.0000 AVG_ED | 0.6736 -0.8232 1.0000 P_EL | -0.3039 0.6149 -0.6526 1.0000 P_GATE | 0.2827 -0.1631 0.2126 -0.1564 1.0000 EMER | -0.0987 0.0197 -0.0407 -0.0211 -0.0541 1.0000 DMOB | 0.5413 -0.0693 0.2123 0.0231 0.2198 -0.0487 1.0000 PCT_AA | -0.2215 0.1625 -0.1057 -0.0718 0.0334 0.1380 -0.1306 1.0000 PCT_AI | -0.1388 0.0461 -0.0246 -0.1510 -0.0812 0.0180 -0.1138 -0.0684 1.0000 PCT_AS | 0.3813 -0.3031 0.3946 -0.0954 0.2321 -0.0247 0.1620 -0.0475 -0.0902 1.0000 PCT_FI | 0.1646 -0.1221 0.1687 -0.0526 0.1281 0.0007 0.1203 0.0578 -0.0788 0.2485 PCT_HI | -0.4301 0.6923 -0.8007 0.7143 -0.1296 -0.0192 -0.0193 -0.0911 -0.1834 -0.3733 PCT_PI | -0.0598 0.0533 -0.0228 0.0286 0.0091 0.0315 -0.0202 0.2195 -0.0311 0.0748 PCT_MR | 0.1468 -0.3714 0.3933 -0.3322 0.0052 0.0102 -0.0928 -0.0053 0.0667 0.0904 | PCT_FI PCT_HI PCT_PI PCT_MR -----------------+------------------------------------ PCT_FI | 1.0000 PCT_HI | -0.1488 1.0000 PCT_PI | 0.2769 -0.0763 1.0000 PCT_MR | 0.0928 -0.4700 0.0611 1.0000
. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED<6, beta Source | SS df MS Number of obs = 10082 --------------+------------------------------ -------------------------------------- F( 6, 10075) = 2947.08 Model | 65503313.6 6 10917218.9 Prob > F = 0.0000 Residual | 37321960.3 10075 3704.41293 R-squared = 0.6370 -------------+---------------------------------------------------------------------- Adj R-squared = 0.6368 Total | 102825274 10081 10199.9081 Root MSE = 60.864 ------------------------------------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------------------------------------- MEALS | .1843877 .0394747 4.67 0.000 .0508435 AVG_ED | 92.81476 1.575453 58.91 0.000 .6976283 P_EL | .6984374 .0469403 14.88 0.000 .1225343 P_GATE | .8179836 .0666113 12.28 0.000 .0769699 EMER | -1.095043 .1424199 -7.69 0.000 -.046344 DMOB | 4.715438 .0817277 57.70 0.000 .3746754 _cons | 52.79082 8.491632 6.22 0.000 . ------------------------------------------------------------------------------------------------------------ . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED<6, beta Source | SS df MS Number of obs = 10082 ----------------+-------------------------------------------------------------------- F( 13, 10068) = 1488.01 Model | 67627352 13 5202104 Prob > F = 0.0000 Residual | 35197921.9 10068 3496.01926 R-squared = 0.6577 -------------+---------------------------------------------------------------------- Adj R-squared = 0.6572 Total | 102825274 10081 10199.9081 Root MSE = 59.127 -------------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta --------------+----------------------------------------------------------------------------------------------- MEALS | .370891 .0395857 9.37 0.000 .1022703 AVG_ED | 89.51041 1.851184 48.35 0.000 .6727917 P_EL | .2773577 .0526058 5.27 0.000 .0486598 P_GATE | .7084009 .0664352 10.66 0.000 .0666584 EMER | -.7563048 .1396315 -5.42 0.000 -.032008 DMOB | 4.398746 .0817144 53.83 0.000 .349512 PCT_AA | -1.096513 .0651923 -16.82 0.000 -.1112841 PCT_AI | -1.731408 .1560803 -11.09 0.000 -.0718944 PCT_AS | .5951273 .0585275 10.17 0.000 .0715228 PCT_FI | .2598189 .1650952 1.57 0.116 .0099543 PCT_HI | .0231088 .0445723 0.52 0.604 .0066676 PCT_PI | -2.745531 .6295791 -4.36 0.000 -.0274142 PCT_MR | -.8061266 .1838885 -4.38 0.000 -.0295927 _cons | 96.52733 9.305661 10.37 0.000 . -----------------------------------------------------------------------------------------------------------
Special Schools (Outliers) GOOD ONES Residual Name Tested/Enrolled 506.0523 Muir Charter 78/78 488.5563 SIATech 65/66 342.7693 Escuela Popular/Center for Training and 88/91 280.2587 YouthBuild Charter School of California 78/78 246.7804 Oakland Charter Academy 238/238 232.4897 Oakland Charter High 146/146 230.0739 Opportunities For Learning - Baldwin Par 1434/1442 BAD ONES -399.4998 Sierra Vista High (SD) 14/15 -342.2773 Baden High (Continuation) 73/73 -336.5667 Dover Bridge to Success 84/88 -322.1879 Millennium High Alternative 43/49 -318.0444 Aurora High (Continuation) 128/131 -315.5069 Sunrise (Special Education) 34/34 -311.1326 Nueva Vista High 20/28
Multiple Regression Weighted by the Number of Test Takes (TESTED) . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ > ED>0 & AVG_ED<6 [aweight = TESTED], beta (sum of wgt is 9.0302e+06) Source | SS df MS Number of obs = 10082 ----------------+-------------------------------------------------------------------- F( 13, 10068) = 2324.54 Model | 41089704.2 13 3160746.48 Prob > F = 0.0000 Residual | 13689769.3 10068 1359.73076 R-squared = 0.7501 ----------------+--------------------------------------------------------------------- Adj R-squared = 0.7498 Total | 54779473.6 10081 5433.9325 Root MSE = 36.875 ------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta ------------------+---------------------------------------------------------------- MEALS | .2401007 .032364 7.42 0.000 .0828479 AVG_ED | 83.84621 1.444873 58.03 0.000 .8044588 P_EL | .1605591 .0405248 3.96 0.000 .0306712 P_GATE | .2649964 .0443791 5.97 0.000 .0317522 EMER | -1.527603 .1503635 -10.16 0.000 -.0513386 DMOB | 3.414537 .0834016 40.94 0.000 .2212861 PCT_AA | -1.275241 .0583403 -21.86 0.000 -.1301146 PCT_AI | -1.96138 .2143326 -9.15 0.000 -.0499468 PCT_AS | .4787539 .0368303 13.00 0.000 .082836 PCT_FI | -.0272983 .1113346 -0.25 0.806 -.0013581 PCT_HI | .0440935 .0351466 1.25 0.210 .0158328 PCT_PI | -2.464109 .5116525 -4.82 0.000 -.0271533 PCT_MR | -.5071886 .1678521 -3.02 0.003 -.0187953 _cons | 220.2237 9.318893 23.63 0.000 . ------------------------------------------------------------------------------
Best Linear Unbiased Estimate (BLUE) • Characteristics of OLS ifsample is probability sample • Unbiased E(b)= themean sample value is the population value • Efficient Min bthe sample values are as close to each other as possible • Consistent as sample size (n) approaches infinity, the sample • value converges on the population value • If the following assumptions are met: • The Model is • Complete • Linear • Additive • Variables are • measured at an interval or ratio scale • without error • The regression error term is • normally distributed • has an expected value of 0 • errors are independent • homoscedasticity • predictors are unrelated to error • In a system of interrelated equations the errors are unrelated to each other