1 / 46

Current Developments in Quantitative R esearch M ethods

Current Developments in Quantitative R esearch M ethods. LOT Winter School January 2014 Luke Plonsky. Welcome & Introductions. Course Introduction. Methodological reform (revolution?) taking place Goal: more accurately inform theory, practice, and future research

alice
Download Presentation

Current Developments in Quantitative R esearch M ethods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Current Developments in Quantitative Research Methods LOT Winter School January 2014 Luke Plonsky

  2. Welcome & Introductions

  3. Course Introduction • Methodological reform (revolution?) taking place • Goal: more accurately inform theory, practice, and future research • Content objectives: conceptual and practical (but mostly conceptual) • Inform participants’ current and future research efforts • Motivate future inquiry with a methodological focus • Not stats-heavy, technical; assumed: basic knowledge of descriptive and inferential statistics (e.g., M, SD, t test, ANOVA) • Examples mostly from second language (L2) research • Lecture, all-group discussion, and small-group discussion  ask Qs at any time!

  4. Course Overview • Monday/today: Statistical power, effect sizes, and fallacies of statistical significance • Tuesday: Meta-analysis and the synthetic approach • Wednesday: Assessing methodological quality • Thursday: Replication research • Friday: Data transparency, reporting practices, and visualization techniques

  5. Statistical power, effect sizes, and fallacies of statistical significance Luke Plonsky Current Developments in Quantitative Research Methods Day 1

  6. Review of Common Stats: Comparing Means ANOVA t test Mean scores (DV) Groups (IV) from Bialystok & Miller (1999)

  7. Review of Common Stats : Correlations • Question: What is the relationship between two (continuous) variables? • Positive, negative, curvilinear • Strong, weak, moderate, none from DeKeyser (2000)

  8. A Model of Research Conduct a study (e.g., the effects of A on B) What’s wrong with this picture? p < 0.05 p > 0.05 Trash Important finding / Get published! Modify relevant theory, research, practice

  9. p Values

  10. (Another quick review) Q: Wait, real quick: What’s a p value? A1: The probability of similar results (e.g., differences between groups; relationship between variables) given NO difference between groups / no relationship between variables A2: NOT an indication of the magnitude, importance, direction, or replicability of an effect/relationship WHAT WE REALLY WANT TO KNOW! • Also: Observed p values vary as a function of sample size (N), effect size (e.g., Cohen’s d), and variance.

  11. (Anderson et al., 2000) OK, on to the Controversy • 60+ years and 400+ articles (e.g,., Schmidt, 1996; Thompson, 2001) “The almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories … is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology” (Meehl, 1967, p. 72). • APA Task Force on Statistical Inference(Wilkinson & TFSI, 1999) • AL: strict (dogmatic?) adherence to NHST; very little discussion until recently (Crookes, 1991; Ellis, 2006; Larson-Hall, 2010; Lazaraton, 1991; Nassaji, 2012; Norris, 2013; Norris & Ortega, 2000, 2006; Oswald & Plonsky, 2010; Plonsky, 2011, 2012, 2013; Plonsky & Gass, 2011) http://oak.ucc.nau.edu/ldp3/bib_nhst.html

  12. Wilkinson & TFSI (1999) • Purpose: “to initiate discussion in the field about changes in current practices of data analysis and reporting” • General recommendations: be transparent; calculate power a priori; inspect data descriptively and visually; simpler analyses are best • Specifics: report exact p values; report (contextualized) ESs for all tests; CIs

  13. Main arguments against NHST?

  14. NHST is Unreliable • The effects of A and B are always different—in some decimal place—for any A and B. Thus asking ‘are the effects different?’ is foolish(Tukey, 1991, p. 100). ↑ ↑ ↓ ↑ ↑ ↓ The (nil) hypothesis that d = 0 is (almost) always false! (Cohen, 1994)

  15. NHST is Unreliable (Cont’d) • Same goes for p values based on correlations • Remember: r = .30 = 0.30 = 0.30 p = .05

  16. NHST is Unreliable (Cont’d) “[with NHST] … tired researchers, having collected data on hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired.” Thompson, 1992, p. 436

  17. NHST is Crude and Uninformative • Continuous data  yes/no dichotomy • p values say nothing about: • Replicability • Theoretical or practical importance • Magnitude of effects • p > .05 ≠ zero effect size: The absence of evidence for differences is not evidence for equivalence(Kline, 2004, p. 67) • Large pvalues can correspond to large effects and vice versa • Other explanations for p > .05? • small sample/low power/high sampling error; small (i.e., hard-to-detect effect size; unreliable instruments; weak treatment; other hidden variables; … • Appropriate for a limited period of exploratory research • (Should be an) inverse relationship between theoretical maturity and reliance on p

  18. NHST is Crude and Uninformative From Papi & Abdollahzadeh (2012) What could these t test and resulting p values possibly contribute here?

  19. Key p >.05 but sizeable d Do you see any similar patterns here? (Hint: look at the p values and ESs) p <.05 but not large d p >.05 w/neg.d Taylor et al. (2006)

  20. NHST is Arbitrary • …surely, God loves the .06 nearly as much as the .05 (Rosnow & Rosenthal, 1989, p. 1277) • How much more (or less) would we know if the conventional alpha level was .03 (or .15)? • What if tests of statistical significance never existed? (Harlow et al., 1997)

  21. Conduct a study (e.g., the effects of A on B) p < 0.05 p > 0.05 Trash Important finding / Get published! Modify relevant theory, research, practice NHST is Counter-productive • Adherence to NHST (and pvalues) constrains progress of theory inefficient research efforts • NHST & publication bias (Rothstein, et al., 2005) • Scenario: 100 intervention studies; H0 is true (i.e., no difference between treatments A and B with alpha .05) • (At least) 5 studies will find p < .05 • 95 studies will sit unpublished, or be re-run until p < .05 (jelly beans cause acne) • Type 1 error (false positive) in published studies = 100% • Treatment effects (which are nil) become grossly overestimated

  22. Summary • (Quantitative) linguistics research relies heavily on NHST, which is… • highly controversial at best and possibly dangerous and to-be-avoided; • unreliable; • crude and uninformative; • arbitrary; and • counter-productive OK, but what we can do to improve?

  23. Power (Or: a possible solution to our obsession with p values?)

  24. Statistical Power • What is it? • Why does it matter? • How many participants do I need? (A very practical and common question)

  25. What kind of power is needed vs. typical? • Table 2 in Cohen (1992) d=0.5 d=0.8 d=0.2 Are these Ns typical in linguistics research?

  26. What kind of power is needed vs. typical? • Plonsky & Gass (2011) • 2% conducted a power analysis • Median d = 0.65 + median n = 22 • Overall post hoc power = .56 • Plonsky (2013) • 1% (6/606 studies) conducted a power analysis • median d = .71 (inflated?) + median n = 19 • Overall post hoc power = .57 • What does this mean for • Internal validity (and, hence, external validity/generalizability)? • Past research? • Theory-building? • Practical implications? • Availability bias in meta-analyses?

  27. The “Power Problem” in L2 Research (Plonsky, 2013, in press) • Rarely analyze power • Small samples • Heavy reliance on NHST (median = 18) • Effects not generally very large • Omission of non-statistical results • Rarely check assumptions • Rarely use multivariate statistics

  28. Tools for Power Analysis • Cohen’s (1988, 1992) power tables • A priori • Conceptually? • Practically: http://danielsoper.com/statcalc3/calc.aspx?id=47 • Post hoc • Conceptually? • Practically: http://danielsoper.com/statcalc3/calc.aspx?id=49

  29. Quick Review

  30. What if you can’t get enough power? • This may be the case when, for example… • You’re studying a very small or hard-to-find population (L3 learners of Swahili with L1 Korean) • You have limited funding for running participants • Your phenomenon/relationship/effect of interest is small (i.e., hard to detect) • Your advisor says you can’t use the PSY participant pool • Avoid or limit inferential stats • Form less (sub)groups  less contrasts • Focus on descriptives (including effect sizes and CIs) • ‘Bootstrap’ the data?

  31. Bootstrapping • Random re-sampling from observed data to produce a simulated but more stable outcome (see Larson-Hall & Herrington, 2010) • (More) robust to: outliers, non-normal data  common • Larson-Hall & Herrington (2010) • ANOVA: p>.05 between NSs (n=15) and 3 learner groups (n=14, 15, 15) • Tukey post hocs: p < .05 ONLY between NSs and Group A (p = .002); pb = .407; pc = .834 • Bootstrappedpost hoc tests  p < .05for all three groups • pvalues non-statistical due to a lack of power; Type II error • Plonsky et al. (in press) • Re-analyzed raw data from 26 primary L2 studies • 4 (of 16) Type I ‘misfits’ (i.e., 25% Type I ‘misfit’ rate) • 0 Type II ‘misfits’ • Too much power (via large N)  inflated findings?

  32. BUT EVEN WITH GREATER POWER VIA BOOTSRTAPPING, OUR RESULTS ARE STILL BASED ON THE FLAWED NOTION OF STATISTICAL SIGNIFICANCE

  33. EFFECT SIZES!(Or: a MUCH BETTER solution to our obsession with pvalues)

  34. Effect Sizes Questions we’ll address • What are they? How do we calculate them? • What advantages do ESs provide over p values? • How can we interpret ESs?

  35. What is an effect size? • A quantitative indication of the strength of a relationship or an effect • Common effect sizes • Standardized mean differences (Cohen’s d) • M1-M2 / SDpooled (see Excel macro for calculating d) • Correlation coefficients (e.g., r) • Shared variance (R2, eta2) • Odds Ratios (likelihood of A given B) • Percentages

  36. Why Effect Sizes?- An alternative to NHST (p) - Null Hypothesis Significance Testing (p) vs. Effect Sizes (d) • Unreliable: result dependent on sample size (e.g., Kline, 2009) ESs: not dependent on N • Crude and uninformative: a) forces continuous data into a yes/no dichotomy; b) tells us nothing about practical significance or magnitude (e.g., Cohen, 1994) ESs: Express magnitude/size of relationship (i.e., WHAT WE REALLY WANT TO KNOW) • Arbitrary: …surely, God loves the .06 nearly as much as the .05 (Rosnow & Rosenthal, 1989, p. 1277) ESs: Continuousand can be compared/combined across studies

  37. Research Questions and Their Answers Using NHST vs. ESs • Think of a study you read recently or one that you’re working on. • What were the RQs? • Where they phrased dichotomously (Do …? Is there a difference …?)? • If so, what kind of answer can come from such a RQ? • How might the findings differ with an emphasis on magnitude rather than presence/absence of a relationship or effect?

  38. Why Effect Sizes?- Journal Requirements - • APA Publication Manual, 6th Edition • Three major L2 Journals: Language Learning, TESOL Quarterly, Modern Language Journal • Plonsky & Gass (2011):0%(1980s) 0%(1990s) 27%(2000s) • Plonsky(2013): 3%(1990s)  42%(2000s) So now effect sizes get reported more often… •  ?

  39. …but very rarely do we interpret them What do they mean anyway? How big is ‘big’? And how small is ‘small’? What does d = 0.50 (or 0.10, or 1.00…) mean? What implications do these effect have for future research, theory, and practice? SMALL BIG

  40. ESs: Summary Empirically-based, field-specific scale for dvalues in L2 research • ESs are best understood in relation to other, field-specific effects • d ≈ 0.40 (small) • d ≈ 0.70 (medium) • d≈ 1.00 (large) • …if people interpreted effect sizes [using fixed benchmarks] with the same rigidity that .05 has been used in statistical testing, we would merely be being stupid in another metric(Thompson, 2001, pp. 82–83). • Additional considerations: • Theoretical and methodological maturity (over time) • SD units • Research setting (labvs. classroom; SL vs. FL) • Length/intensity of treatment • Manipulation of IVs • Publication bias • Sample size / sampling error • Instrument reliability

  41. A Revised Model of Research Conduct a study (e.g., the effects of A on B) p < 0.05 d = ? p > 0.05 d = ? Trash Accumulation of results (via meta-analysis) More precise and reliable estimate of effects Modify relevant theory, research, practice

  42. Based on our discussion today, what changes would you suggest to the field?

  43. 10 Suggestions for Reform • A diminished reliance on NHST / p-values • Drop the “significant” from “statistically significant” • Focus on the practical and theoretical importance of results • Better educate ourselves and future generations of researchers  Emphasize: ESs, alternatives to NHST, synthetic-mindedness in primary research  De-emphasize NHST • ESs (for all findings, not only when p < .05) • CIs (for all findings, not only when p < .05)  “a quiet but insistent reminder that no knowledge is complete or perfect”(Sagan, 1996) • Replication (to mitigate effects of low power) • Examine data visually • Meta-analysis / a synthetic approach • Initiative from the top down

  44. Further Reading • Beyond significance testing (Kline, 2013) • The cult of statistical significance (McCloskey, 2008) • Understanding the new statistics (Cumming, 2012) • Effect sizes for research (Grissom & Kim, 2012, 2nd ed.) • Statistical power analysis for the behavioral sciences (Cohen, 1988, 2nd ed.)

  45. Connections to Other Topics to be Discussed this Week • Meta-analysis (relies on ES) rather than p values (TUESDAY) • Replication (THURSDAY) • Reporting practices (full descriptives including ES, always; data transparency, etc.) (FRIDAY)

  46. Tomorrow: Meta-analysis • Motivation for and benefits of (conceptual understanding) • Procedures/techniques (practical understanding)

More Related