2014 Joint Statistical Meetings (JSM) Boston, MA Sharon Lane-Getaz

What Students Learn (and Don’t Learn) about Inferential Reasoning in Introductory Statistics Courses 2014 Joint Statistical Meetings (JSM) Boston, MA Sharon Lane-Getaz St. Olaf College, Northfield, MN 55057 lanegeta@stolaf.edu

Objective What does statistics education research report about correct conceptions, difficulties and misconceptions people have with inferential reasoning? How might this be of help to statistical consultant dealing with clients? • Background: To assess impact of methods on teaching inference, developed instrument to assess 14 known misconceptions and difficulties, added items to assess correct conceptions. • Measurement: Reasoning about P-values and Statistical Significance (RPASS) scale reliability in this study is Cronbach’s alpha = .76 (37 items). • Study: Compare Pretest and Posttest proportions of students answering each item correctly on a scatterplot (canoe plot). • Discussion: Emphasize what students generally learn and what problems tend to persist. Sharon Lane-Getaz, lanegeta@stolaf.edu

Subjects and Setting • Subjects(N= 138) from two introductory-level statistics courses aimed at the social sciences (n1 = 78) and natural sciences(n2 = 60). • 138 out of 167 enrolled students completed the Pre- and Posttest, and consented to participate (83% response) • (94) females, (43) males, (1) no response • (34) first years, (56) sophomores, (30) juniors, (18) seniors. • Setting: Small liberal arts college (3000 students) in the upper Midwest US, a small town of “cows, colleges and contentment” • Time: Spring semester 2011. Sharon Lane-Getaz, lanegeta@stolaf.edu

Broad range of results with two courses combined: RPASS-9 Pretestsand Posttest Totals Sharon Lane-Getaz, lanegeta@stolaf.edu

Pre- and Posttest Totals Gains by Course Sharon Lane-Getaz, lanegeta@stolaf.edu

Aggregate Results for Both Courses (N = 138) • 70% of 37 RPASS-9 Posttest items correct, on average. • Five more Posttest items correct,on average: • RPASS-9 Posttest (Mean = 26.1, SD = 5.1) • RPASS-9 Pretest (Mean = 21.0, SD = 4.2) • What did students learn, by item, … and what did they not learn? Sharon Lane-Getaz, lanegeta@stolaf.edu

Item-Level Analysis (Canoe Plot) • Canoe Plot of item-level changes in proportion correct • Scatterplot of Pretest to Posttest proportions by item • 95% confidence band along pposttest = ppretest differentiates items with a significant difference in proportions answering correctly from items with insignificant differences (Posttest – Pretest). • Wilson adjusted margins of error: maintains a 95% nominal rate (Agresti & Caffo, 2000). • No family-wise correction, intended for descriptive purposes. Sharon Lane-Getaz, lanegeta@stolaf.edu

Proportion Correct Responses by RPASS-9 itemPretest on x, Posttest on y (37 items, N = 138) 23 items above the 95% confidence band, 13 within, and 1 below

Improved 14 Correct Conceptionsof the 23 Items “Above the Band” • Improved Statistical Literacy: • Recognize textbookdefinitionsof p-value (1-1, 6-1) • Link p-value to sampling variation (2-1) • Understand p-value as a rareness measure (3a-2) • Improved Inferential Reasoning: • Assess significancegraphically(3b-1) • Reason aboutvariation(3c-2) • Assessimpact of alternative hypothesis on p-value (1-3, 4b-1) • Differentiate small p-values,Type I and II errors(6-2, 6-7) • Reason about sample sizeimpact onp-value(6-4) • Reason about strength of evidence vs. p-value (2-2, 4a-1, 6-3) (5) Green items indicate pc< .50 on Pretest Sharon Lane-Getaz, lanegeta@stolaf.edu

Improved (Suppressed) 9 Misconceptionsof the 23 items “Above the Band” • State conclusions within confines ofscope of inference: • Need random sample to generalizesample to population (5-4) • Need random assignment to draw causal conclusion (4a-3). • Interpret what a P-value is NOT: • Always small or always desired to be low value (3a-3, 3b-3) • Probability the Null Hypothesisis false or true (5-1, 5-2) • Alpha or significance level (4a-1) • Interpret that a small P-value does NOT mean: • Chance caused results observed (2-4) • Provides definitive, contrapositive proof (3a-1) (3) Red items indicate pc< .50 on Pretest Sharon Lane-Getaz, lanegeta@stolaf.edu

No Improvement: “Within the Band”Correct Conceptions (C) • Reason about variation in boxplot depiction(3c-1) C • Making correct rejection decision(4b-3) C • Recognize an informal definition of p-value (1-2) C • Recognize p-value as a conditional probability (2-3) C • Use Confidence Intervals for statistical significance (2-5) C • Differentiate p-values from effects (4a-2) C • Interpret large p-value (4b-2) C • Consider impact of sample size on p-values (4b-4, 6-4) C Green indicates pc< .50 on Pretest Sharon Lane-Getaz, lanegeta@stolaf.edu

No Improvement: “Within the Band” Misconceptions (M) or Multiple Choice Items • Belief increased replications = increased sample size (4b-6) M • Belief p-valuesalways lowor desired to be low (3b-2) M • Differentiate statisticalvs.practical significance (4b-5, 6-5) C/M • Checkconditions before making an inference (6-6) C/M Red indicates pc< .50 on Pretest Sharon Lane-Getaz, lanegeta@stolaf.edu

The One item “Below the Band”Unlearning, Guessing, Confusion? Responses for one item suggest better reasoning on the Pretest than on the Posttest (just below the 95% confidence band): When asked to choose correct direction to shade the p-value in the samplingdistribution of means (3b-4) Students tend to select shade “to the right;” even though the alternative hypothesis suggests that one should shade the larger left tail. Sharon Lane-Getaz, lanegeta@stolaf.edu

Remind clients of caveats and limitations of the statistical inference process. • P-value is an integrated part of the larger statistical process • Logic of inference (how we interpret results)depends on sample size, relates to effect size and importance, and whether conditions were met. • Scope of inference (what we can conclude) depends on randomness in study design; how the data were gathered • Confidence interval (CI) estimates population parameters or true effects, given the sample we observed…and • Provides complementary informationthan p-values do alone (bounds for the effect). • Can assess statistical significance. For example, point out whether a null hypothesis is in the interval or not. Is zero in the interval? Is the interval all positive or all negative? Sharon Lane-Getaz, lanegeta@stolaf.edu

Students in a randomization-based curriculum learn more on average, but ironically show no improvement on 5 items associated with the randomization distribution: A Surprise Aside • How one- or two-tailed test relates to p-value (4b-2) M • Correct rejection decision (4b-3) C • Impact of sample size on significance (4b-4) M • Significance vs. practical importance (4b-5) • Impact of increasing sample size vs. replications (4b-6) M Sharon Lane-Getaz, lanegeta@stolaf.edu

References Agresti, A, & Caffo, B. (2000), Simple and Effective Confidence Intervals for Proportions and Differences of Proportions result from Adding Two Successes and Two Failures. The AmericanStatistician, 54(4), 280–288. Chance, B. L., & Rossman, A. J. (2006), Investigating Statistical Concepts, Applications, and Methods, Belmont, CA: Brooks/Cole – Thomson Learning. Cobb, G. (2007), The Introductory Statistics Course: A Ptolemaic Curriculum?. Technology Innovations in Statistics Education, 1,(1).http://repositories.cdlib.org/uclastat/cts/tise/ Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to read pictures of data. American Psychologist, 60(2), 170-180. delMas, R. C., Garfield, J. B., Ooms, A., & Chance, B. (2007), Assessing Students’ Conceptual Understanding after a First Course in Statistics. Statistics Education Research Journal [online], (6)2, 28-58.http://www.stat.auckland.ac.nz/serj Lane-Getaz, S. J. (2013). Development of a Reliable Measure of Students’ Inferential Reasoning Ability. Statistics Education Research Journal (SERJ), 12(1), 20-47. http://iase-web.org/documents/SERJ/SERJ12(1)_LaneGetaz.pdf Lane-Getaz, S. J. (2007). Toward the Development and Validation of the Reasoning about P- values and Statistical Significance Scale. In B. Phillips & L. Weldon (Eds.), Proceedings of the ISI / IASE Satellite Conference on Assessing Student Learning in Statistics, Voorburg, The Netherlands: ISI.http://www.stat.auckland.ac.nz/~iase/publications/sat07/Lane-Getaz.pdf Utts, J. (2003). What Educated Citizens Should Know about Statistics and Probability. The American Statistician,57(2), 74-79. Sharon Lane-Getaz, lanegeta@stolaf.edu

Contact Information & Slides • Sharon Lane-Getaz, lanegeta@stolaf.edu On sabbatical this coming year and would love to collaborate with YOU to administer the RPASS at your institution! Let’s talk! • These JSM-2014 presentation slides will be available from: http://sharonlanegetaz.efoliomn.com/JSM2014 • The differences in proportions by item appear in the Appendix of this presentation. Please see the proceedings for more!

Table 1. Proportion Correct on RPASS-9 Posttest item exceeds Pretest Proportion Correct (12 of 23 items) Sharon Lane-Getaz, lanegeta@stolaf.edu Note. aItems associated with sampling or randomization distribution. bRequests explanation of reasoning.

Table 1 contd. Proportion Correct on RPASS-9 Posttest exceeds Pretest Proportion Correct (11 of 23 items) Sharon Lane-Getaz, lanegeta@stolaf.edu Note. aItems associated with sampling or randomization distribution. bRequests explanation of reasoning.

Table 2: Equal Proportion of Students Answer RPASS-9 Item Correctly On Posttest and Pretest (13 items) Sharon Lane-Getaz, lanegeta@stolaf.edu Note. aItems associated with sampling or randomization distribution. bRequests explanation of reasoning.

2014 Joint Statistical Meetings (JSM) Boston, MA Sharon Lane-Getaz