480 likes | 706 Views
Statistics Micro Mini Multiple Regression. January 5-9, 2008 Beth Ayers. Tuesday 1pm-4pm Session. Dummy Variables Multiple regression Using quantitative and categorical explanatory variables Interactions among explanatory variables Linear regression vs. ANCOVA Two article critiques.
E N D
Statistics Micro Mini Multiple Regression January 5-9, 2008 Beth Ayers
Tuesday 1pm-4pm Session • Dummy Variables • Multiple regression • Using quantitative and categorical explanatory variables • Interactions among explanatory variables • Linear regression vs. ANCOVA • Two article critiques
Dummy Variables • Categorical explanatory variables can be used in a linear regression if they are coded as dummy variables • For binary variables, the most frequently used codes are 0/1 and -1/+1 • For a nominal variable with k levels, create k-1 explanatory variables that are each 0/1 • each subject can have a value of one for at most one of the explanatory variables
Dummy Variables • Suppose we’d like to use a categorical variable that indicates which tutor (A or B) the student used. Define:
Significance testing • To test if X2 has an affect • H0: ¯2 = 0 • H1: ¯2 ≠ 0 • This is the usual t-test for a regression coefficient, we don’t need to do anything different for dummy variables • If ¯2 = 0, then there is no difference between the mean response of Tutor A and B • If ¯2 ≠ 0, then ¯2 is the difference between the mean response for Tutor A and Tutor B
Interpretation • Y = ¯0 + ¯1*X1 + ¯2*X2 • Can think of this is two equations • When X2 = 0 • Y = ¯0 + ¯1*X1 • When X2 = 1 • Y = ¯0 +¯1*X1 + ¯2 * 1= (¯0 + ¯2) + ¯1*X1 • Then ¯0 + ¯1 is the new intercept for the case where X2 = 1
Dummy Variables • Suppose we have three tutors (A, B, C). Define: • Tutor A is considered the baseline
Interpretation • Y = ¯0 + ¯1*X1 + ¯2*X2 + ¯3*X3 • Can think of this as three equations • When X2 = 0 and X3 = 0 • Y = ¯0 + ¯1*X1 • When X2 = 1 and X3 = 0 • Y = ¯0 +¯1*X1 + ¯2 * 1= (¯0 + ¯2) + ¯1*X1 • When X2 = 0 and X3 = 1 • Y = ¯0 +¯1*X1 + ¯2 * 1= (¯0 + ¯3) + ¯1*X1
Interpretation • ¯2 is then the difference between the mean response for Tutor A and Tutor B • ¯3 is then the difference between the mean response for Tutor A and Tutor C • To formally compare Tutor B to Tutor C, one must rerun the regression using either Tutor B or C as the baseline • To informally compare them, one can look at the difference between ¯2 and ¯3
Significance testing • Again, we can use the usual t-test for a regression coefficient, we don’t need to do anything different
Example • Want to see if there is a gender effect in predicting efficiency • Efficiency = ¯0 + ¯1*WPM + ¯2*Gender • where
Example • Step 1 • F-statistic: 791 • P-value = 0.0000 • So at least one of the two variables is important in predicting Efficiency
Example • Step 2 • Test words per minute • T-statistic: -38.34 • P-value = 0.000 • Test Gender • T-statistic: -11.32 • P-value = 0.000 • Both words per minute and gender are important in predicting efficiency
Example • Regression Equations • Males • Efficiency = 84.77 – 0.49¢WPM • Females • Efficiency = 84.77 – 0.49¢WPM – 3.14¢1 • Efficiency = 81.63 – 0.49¢WPM
Interpretation of the Parameters • For words per minute: for each additional word per minute that a student can type, their efficiency increases by 0.5 minutes • For Gender: Holding words per minute constant, females are, on average, more efficient by 3.14 minutes
Interaction • An interaction occurs between two or more explanatory variables (not between an explanatory variable and the response variable) • An interaction occurs when the effect of a change in the level or value of one explanatory variable depends on the level or value of another explanatory variable • In regression we account for an interaction by adding a variable that is the product of two existing explanatory variables
Interpretation • Y = ¯0 + ¯1*X1 + ¯2*X2 + ¯3* X1*X2 • Assume that X2 is a dummy variable and X1*X2 is the interaction • Again, can think of this is two equations • When X2 = 0 • Y = ¯0 + ¯1*X1 • When X2 = 1 • Y = ¯0 + ¯1*X1 + ¯2*1+ ¯3* X1*1 = (¯0 + ¯2) + (¯1 + ¯3)* X1 • We can think of this as a new intercept and new slope for the case where X2 = 1
Interpretation • Y = ¯0 + ¯1*X1 + ¯2*X2 + ¯3* X1*X2 • ¯3 is called the interaction effect • ¯1 and ¯2 are called main effects
Interpretation • If ¯3 is not significant, drop the interaction and rerun the regression • Including the interaction, when it is not significant, can alter the interpretations of the other variables • If ¯3 is significant, do not need to check if ¯1 and ¯2 are significant. We will always keep X1 and X2 in the regression
Interaction Example • Suppose we have two versions of a tutor and we want to know which helps students study for a math test • In addition, we want to know if a student’s SAT math score affects their exam score • We know which tutor each student used and we also have their SAT score and
Interaction Example • Sample output
Interaction Example • Step 1: are any of the variables significant in predicting exam score • F-statistic: 6025 • P-value = 0.000 • Step 2: check interaction first • T-statistic: 15.980 • P-value = 0.000 • Do not need to check main effects since the interaction is significant
Interaction Example • Regression equation • Tutor A (tutor = 1) • Exam score = (2.62 + 6.39) + (0.06+0.05) MathSAT • Exam score = 9.01 + 0.11 •MathSAT • Tutor B (tutor = 0) • Exam score = 2.62 + 0.06 •MathSAT
Interpretation of Coefficients • On average, students using Tutor A have scores 6.39 points higher than students using tutor B • For students using Tutor A, for each point that their Math SAT score increases, their exam score increases by 0.11 • For students using Tutor B, for each point that their Math SAT score increases, their exam score increases by 0.06
Example • Explanatory variables • GPA (0-5 scale) • Math SAT score • Time on tutor (in hours) • Tutor used (A, B, C) • Response variable • Exam score
The Regression • Think that time on tutor and the type of tutor may have an interaction
Analysis • Step 1 • F-stat = 769.5 p-value = 0.000 • Step 2 • Test the interactions first • Test Time * Tutor B • T-statistic: -0.727 P-value = 0.471 • Test Time * Tutor C • T-statistic: -0.195 P-value = 0.847
Next Steps • Since neither interaction is significant, I would drop those two variables and rerun the regression • Including the interaction, when it is not significant, can alter the interpretations of the other variables
Analysis • Step 1 • F-stat = 1111 p-value = 0.000 • Step 2 • Test gpa • T-statistic: 10.28 P-value = 0.000 • Test Math SAT score • T-statistic: 70.03 P-value = 0.000 • Test time on tutor • T-statistic: -0.43 P-value = 0.672 • Test Tutor B • T-statistic: -10.52 P-value = 0.000 • Test Tutor C • T-statistic: 2.60 P-value = 0.0128
Next step • Time on tutor is not significant • Drop time and rerun
Analysis • Step 1 • F-stat = 1414 p-value = 0.000 • Step 2 • Test gpa • T-statistic: 10.51 P-value = 0.000 • Test Math SAT score • T-statistic: 70.69 P-value = 0.000 • Test Tutor B • T-statistic: -10.80 P-value = 0.000 • Test Tutor C • T-statistic: 2.67 P-value = 0.011
Interpretation • For each addition GPA point, a student scores on average 2.1 points higher on the final exam • For each addition Math SAT point, a student scores on average 0.11 points higher on the final exam
Interpretation of Dummy Variables • Students who used Tutor B scored on average 4.6 points lower on the final exam, compared to students using tutor A • Students who used Tutor C scored on average 1.1 points higher on the final exam, compared to students using tutor A
Interpretation of Dummy Variables • We can say that students who used Tutor C scored on average 1.10-(-4.63) = 5.73 points higher than students who used Tutor B • However, to say if it is a significant difference one would need to rerun the regression equation with either Tutor B or C as the baseline • Although 5.73 is large, since we do NOT have a test statistic and p-value we can not make any claims about significance
Example • Suppose we have the following regression • Exam Score = 2.7 + 3.21*gpa + 0.18*MathSAT + 1.3*time + 1.01*TutorB - 1.44*TutorC + 1.8*time*TutorB - 1.7*time*TutorC • Assume that in Step 1 we reject the null and that in Step 2 gpa, Math SAT, and the interaction are significant. • Remember, since the interaction is significant, we are not concerned with the significance of time or tutor alone
Interpretation • Tutor A • Exam Score = 2.7 + 3.21*gpa + 0.18*MathSAT + 1.3*time • Tutor B • Exam Score = 2.7 + 3.21*gpa + 0.18*MathSAT + 1.3*time + 1.01*TutorB + 1.8*time*TutorB • Exam Score = (2.7 +1.01) + 3.21*gpa + 0.18*MathSAT + (1.3 + 1.8)*time • Exam Score = 3.71 + 3.21*gpa + 0.18*MathSAT + 3.1*time • Tutor C • Exam Score = 2.7 + 3.21*gpa + 0.18*MathSAT + 1.3*time - 1.44*TutorC - 1.7*time*TutorC • Exam Score = (2.7 -1.44) + 3.21*gpa + 0.18*MathSAT + (1.3 - 1.7)*time • Exam Score = 1.26 + 3.21*gpa + 0.18*MathSAT - 0.40*time
Interpretation • For each additional point in GPA, a student’s exam score increases by 3.21 • For each additional point in Math SAT, a student’s exam score increases by 0.18 • Students who use tutor B score on average 1.01 points higher on the final exam than students using tutor A • Students who use tutor C score on average 1.44 points lower on the final exam than students using tutor A
Interpretation • Students using Tutor A • For each additional minute on the tutor, students exam scores increase by 1.3 • Students using Tutor B • For each additional minute on the tutor, students exam scores increase by 3.1 • Students using Tutor C • For each additional minute on the tutor, students exam scores decrease by 0.40
ANCOVA • Analysis of Covariance • At least one quantitative and one categorical explanatory variable • In general, the main interest is the effects of the categorical variable and the quantitative variable is considered to be a control variable • It is a blending of regression and ANOVA
ANCOVA • Can either run a linear regression with a dummy variable or as an ANCOVA model, in which case output is similar to ANOVA models • Will get the same results in either case! • Different statistical packages make one or the other easier to run • It is a matter of preference and interpretation