Sociology 601 Class 19: November 3, 2008

Sociology 601 Class 19: November 3, 2008 • Review of correlation and standardized coefficients • Statistical inference for the slope (9.5) • Violations of Model Assumptions, and their effects (9.6)

9.5 Inference for a slope. • Problem: we have measures for the strength of association between two linear variables, but no measures for the statistical significance of that association. • We know the slope & intercept for our sample; what can we say about the slope & intercept for the population? • Solution: hypothesis tests for a slope and confidence intervals for a slope. • Need a standard error for the coefficients • Difficulties: additional assumptions, complications with estimating a standard error for a slope.

Assumptions Needed to make Population Inferences for slopes. • The sample is selected randomly. • X and Y are interval scale variables. • The mean of Y is related to X by the linear equation E{Y} =  + X. • The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity) • The conditional distribution of Y at each value of X is normal. • There is no error in the measurement of X.

Common Ways to Violate These Assumptions • The sample is selected randomly. • Cluster sampling (e.g., census tracts / neighborhoods) causes observations in any cluster to be more similar than to observations outside the cluster. • Two or more siblings in the same family. • Sample = populations (e.g., states in the U.S.) • X and Y are interval scale variables. • Ordinal scale attitude measures • Nominal scale categories (e.g., race/ethnicity, religion)

Common Ways to Violate These Assumptions (2) • The mean of Y is related to X by the linear equation E{Y} =  + X. • U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita) • Thresholds: • Logarithmic (e.g., earnings <- education) • The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity) • earnings <- education • hours worked <- time • adult child occupational status <- parental occupational status

Common Ways to Violate These Assumptions (3) • The conditional distribution of Y at each value of X is normal. • earnings (skewed) <- education • Y is binary, or a % • There is no error in the measurement of X. • almost everything • what is the effect of measurement error in x on b?

The Null hypothesis for slopes • Null hypothesis: the variables are statistically independent. • Ho:  = 0. The null hypothesis is that there is no linear relationship between X and Y. • Implication for : E{Y} =  + 0*X =  ; •  = . • (Draw figure of distribution of Y, X when Hois true)

Test Statistic for slopes • What is the range of b’s we would get if we take repeated samples from a population and calculate b for each of those samples? • That is, what is the standard error of the sample slope b’s? • Test statistic: t = b /hat b • where hat bis the standard error of the sample • slope b. • df for the t statistic (with one x – variable) is n-2 • when n is large, the t statistic is asymptotically equivalent to a z-statistic • What would make hat b smaller?

Calculating the s.e. of b • hat b = hat / (sX*sqrt(n-1)) • where hat = sqrt(SSE/n-2) (= root MSE) • the standard error of b is smaller when… • the sample size is large • the standard deviation of X is large (there is a wide range of X values) • the conditional standard deviation of Y is small.

Conclusions about Population • P-value: • calculated as in any t-test, but remember df = n-2 • a z-test is appropriate when n > 30 or so • Conclusions: • evaluate p-value based o n a previously selected alpha level • Rule of thumb: b should be at least 2x standard error.

Example of Inference about a Slope • In an analysis of poverty and crime in the 50 states plus DC, a computer output provides the following: • E{Murder rate} = -10.14 + 1.322*{Poverty rate} • (Poverty rate in %, murder rate per 100,000) • SSE = 3904.3 SST = 5743.3 • N = 51 Sx = 4.584 • Do a hypothesis test to determine whether there is a linear relationship between crime rates and poverty rates.

Stata Example of Inference about a Slope • In an analysis of poverty and crime in the 50 states plus DC, stata computer output provides the following: • regress murder poverty • Source | SS df MS Number of obs = 51 • -------------+------------------------------ F( 1, 49) = 23.08 • Model | 1839.06931 1 1839.06931 Prob > F = 0.0000 • Residual | 3904.25223 49 79.6786169 R-squared = 0.3202 • -------------+------------------------------ Adj R-squared = 0.3063 • Total | 5743.32154 50 114.866431 Root MSE = 8.9263 • ------------------------------------------------------------------------------ • murder | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • poverty | 1.32296 .2753711 4.80 0.000 .7695805 1.876339 • _cons | -10.1364 4.120616 -2.46 0.017 -18.41708 -1.855707 • ----------------------------------------------------------------------------- • Interpret whether there is a linear relationship between crime rates and poverty rates.

Example of Inference about a Slope • SSE = 3904.3 SST = 5743.3 • N = 51 Sx = 4.58 • b= 1.323 • b= 1.323

Example of Inference about a Slope • SSE = 3904.3 SST = 5743.3 • N = 51 Sx = 4.58 • b= 1.323 • seb= sqrt (SSE / (n-2) ) / (sx * sqrt(n-1)) • = sqrt (3904.3/49) / ( 4.585*sqrt(50) ) • = sqrt (79.68) / (4.585 * 7.071) • = 8.926 / 32.421 • = 0.275 • t = b / seb = 1.323 / 0.275 = 4.81 • p < .001 • 95% confidence interval for b = 0.783 to 1.861

Confidence interval for a slope. • Confidence interval for a slope: • c.i. = b ± t*hat b • the standard t-score for a 95% confidence interval is • t.025, with df = n-2 • An alternative to a confidence interval is to report both b and hat b .

Example of Confidence Interval of a Slope • SSE = 3904.3 SST = 5743.3 • N = 51 Sx = 4.58 • b = 1.323 • seb = 0.275 • 95% confidence interval for • b = 1.322 +- 2.009*0.275 • = 1.322 +- 0.552 • = 0.783 to 1.861

Inference for a slope using STATA • . regress attend regul • Source | SS df MS Number of obs = 18 • -------------+------------------------------ F( 1, 16) = 9.65 • Model | 2240.05128 1 2240.05128 Prob > F = 0.0068 • Residual | 3715.94872 16 232.246795 R-squared = 0.3761 • -------------+------------------------------ Adj R-squared = 0.3371 • Total | 5956 17 350.352941 Root MSE = 15.24 • ------------------------------------------------------------------------------ • attend | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • regul | -5.358974 1.72555 -3.11 0.007 -9.016977 -1.700972 • _cons | 36.83761 5.395698 6.83 0.000 25.39924 48.27598 • ------------------------------------------------------------------------------ • The significance test and confidence interval for b appear on the line with the name of the x-variable. • Can you find SSE and SST? df for the model? r?

Things to watch out for: extrapolation. • Extrapolation beyond observed values of X is dangerous. • The pattern may be nonlinear. • Even if the pattern is linear, the standard errors become increasingly wide. • Be especially careful interpreting the Y-intercept: it may lie outside the observed data. • e.g., year zero • e.g., zero education in the U.S. • e.g., zero parity

Things to watch out for: outliers • Influential observations and outliers may unduly influence the fit of the model. • The slope and standard error of the slope may be affected by influential observations. • This is an inherent weakness of least squares regression. • You may wish to evaluate two models; one with and one without the influential observations.

Things to watch out for: truncated samples • Truncated samples cause the opposite problems of influential observations and outliers. • Truncation on the X axis reduces the correlation coefficient for the remaining data. • Truncation on the Y axis is a worse problem, because it violates the assumption of normally distributed errors. • Examples: Topcoded income data, health as measured by number of days spent in a hospital in a year.

Things to watch out for: measurement error • Error in measurement of the X variable creates a bias that makes the correlation appear weaker. • This problem can be a measurement issue or an interpretation issue.

Sociology 601 Class 19: November 3, 2008