460 likes | 578 Views
Current Developments in Quantitative R esearch M ethods. LOT Winter School January 2014 Luke Plonsky. Welcome & Introductions. Course Introduction. Methodological reform (revolution?) taking place Goal: more accurately inform theory, practice, and future research
E N D
Current Developments in Quantitative Research Methods LOT Winter School January 2014 Luke Plonsky
Course Introduction • Methodological reform (revolution?) taking place • Goal: more accurately inform theory, practice, and future research • Content objectives: conceptual and practical (but mostly conceptual) • Inform participants’ current and future research efforts • Motivate future inquiry with a methodological focus • Not stats-heavy, technical; assumed: basic knowledge of descriptive and inferential statistics (e.g., M, SD, t test, ANOVA) • Examples mostly from second language (L2) research • Lecture, all-group discussion, and small-group discussion ask Qs at any time!
Course Overview • Monday/today: Statistical power, effect sizes, and fallacies of statistical significance • Tuesday: Meta-analysis and the synthetic approach • Wednesday: Assessing methodological quality • Thursday: Replication research • Friday: Data transparency, reporting practices, and visualization techniques
Statistical power, effect sizes, and fallacies of statistical significance Luke Plonsky Current Developments in Quantitative Research Methods Day 1
Review of Common Stats: Comparing Means ANOVA t test Mean scores (DV) Groups (IV) from Bialystok & Miller (1999)
Review of Common Stats : Correlations • Question: What is the relationship between two (continuous) variables? • Positive, negative, curvilinear • Strong, weak, moderate, none from DeKeyser (2000)
A Model of Research Conduct a study (e.g., the effects of A on B) What’s wrong with this picture? p < 0.05 p > 0.05 Trash Important finding / Get published! Modify relevant theory, research, practice
(Another quick review) Q: Wait, real quick: What’s a p value? A1: The probability of similar results (e.g., differences between groups; relationship between variables) given NO difference between groups / no relationship between variables A2: NOT an indication of the magnitude, importance, direction, or replicability of an effect/relationship WHAT WE REALLY WANT TO KNOW! • Also: Observed p values vary as a function of sample size (N), effect size (e.g., Cohen’s d), and variance.
(Anderson et al., 2000) OK, on to the Controversy • 60+ years and 400+ articles (e.g,., Schmidt, 1996; Thompson, 2001) “The almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories … is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology” (Meehl, 1967, p. 72). • APA Task Force on Statistical Inference(Wilkinson & TFSI, 1999) • AL: strict (dogmatic?) adherence to NHST; very little discussion until recently (Crookes, 1991; Ellis, 2006; Larson-Hall, 2010; Lazaraton, 1991; Nassaji, 2012; Norris, 2013; Norris & Ortega, 2000, 2006; Oswald & Plonsky, 2010; Plonsky, 2011, 2012, 2013; Plonsky & Gass, 2011) http://oak.ucc.nau.edu/ldp3/bib_nhst.html
Wilkinson & TFSI (1999) • Purpose: “to initiate discussion in the field about changes in current practices of data analysis and reporting” • General recommendations: be transparent; calculate power a priori; inspect data descriptively and visually; simpler analyses are best • Specifics: report exact p values; report (contextualized) ESs for all tests; CIs
NHST is Unreliable • The effects of A and B are always different—in some decimal place—for any A and B. Thus asking ‘are the effects different?’ is foolish(Tukey, 1991, p. 100). ↑ ↑ ↓ ↑ ↑ ↓ The (nil) hypothesis that d = 0 is (almost) always false! (Cohen, 1994)
NHST is Unreliable (Cont’d) • Same goes for p values based on correlations • Remember: r = .30 = 0.30 = 0.30 p = .05
NHST is Unreliable (Cont’d) “[with NHST] … tired researchers, having collected data on hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired.” Thompson, 1992, p. 436
NHST is Crude and Uninformative • Continuous data yes/no dichotomy • p values say nothing about: • Replicability • Theoretical or practical importance • Magnitude of effects • p > .05 ≠ zero effect size: The absence of evidence for differences is not evidence for equivalence(Kline, 2004, p. 67) • Large pvalues can correspond to large effects and vice versa • Other explanations for p > .05? • small sample/low power/high sampling error; small (i.e., hard-to-detect effect size; unreliable instruments; weak treatment; other hidden variables; … • Appropriate for a limited period of exploratory research • (Should be an) inverse relationship between theoretical maturity and reliance on p
NHST is Crude and Uninformative From Papi & Abdollahzadeh (2012) What could these t test and resulting p values possibly contribute here?
Key p >.05 but sizeable d Do you see any similar patterns here? (Hint: look at the p values and ESs) p <.05 but not large d p >.05 w/neg.d Taylor et al. (2006)
NHST is Arbitrary • …surely, God loves the .06 nearly as much as the .05 (Rosnow & Rosenthal, 1989, p. 1277) • How much more (or less) would we know if the conventional alpha level was .03 (or .15)? • What if tests of statistical significance never existed? (Harlow et al., 1997)
Conduct a study (e.g., the effects of A on B) p < 0.05 p > 0.05 Trash Important finding / Get published! Modify relevant theory, research, practice NHST is Counter-productive • Adherence to NHST (and pvalues) constrains progress of theory inefficient research efforts • NHST & publication bias (Rothstein, et al., 2005) • Scenario: 100 intervention studies; H0 is true (i.e., no difference between treatments A and B with alpha .05) • (At least) 5 studies will find p < .05 • 95 studies will sit unpublished, or be re-run until p < .05 (jelly beans cause acne) • Type 1 error (false positive) in published studies = 100% • Treatment effects (which are nil) become grossly overestimated
Summary • (Quantitative) linguistics research relies heavily on NHST, which is… • highly controversial at best and possibly dangerous and to-be-avoided; • unreliable; • crude and uninformative; • arbitrary; and • counter-productive OK, but what we can do to improve?
Power (Or: a possible solution to our obsession with p values?)
Statistical Power • What is it? • Why does it matter? • How many participants do I need? (A very practical and common question)
What kind of power is needed vs. typical? • Table 2 in Cohen (1992) d=0.5 d=0.8 d=0.2 Are these Ns typical in linguistics research?
What kind of power is needed vs. typical? • Plonsky & Gass (2011) • 2% conducted a power analysis • Median d = 0.65 + median n = 22 • Overall post hoc power = .56 • Plonsky (2013) • 1% (6/606 studies) conducted a power analysis • median d = .71 (inflated?) + median n = 19 • Overall post hoc power = .57 • What does this mean for • Internal validity (and, hence, external validity/generalizability)? • Past research? • Theory-building? • Practical implications? • Availability bias in meta-analyses?
The “Power Problem” in L2 Research (Plonsky, 2013, in press) • Rarely analyze power • Small samples • Heavy reliance on NHST (median = 18) • Effects not generally very large • Omission of non-statistical results • Rarely check assumptions • Rarely use multivariate statistics
Tools for Power Analysis • Cohen’s (1988, 1992) power tables • A priori • Conceptually? • Practically: http://danielsoper.com/statcalc3/calc.aspx?id=47 • Post hoc • Conceptually? • Practically: http://danielsoper.com/statcalc3/calc.aspx?id=49
What if you can’t get enough power? • This may be the case when, for example… • You’re studying a very small or hard-to-find population (L3 learners of Swahili with L1 Korean) • You have limited funding for running participants • Your phenomenon/relationship/effect of interest is small (i.e., hard to detect) • Your advisor says you can’t use the PSY participant pool • Avoid or limit inferential stats • Form less (sub)groups less contrasts • Focus on descriptives (including effect sizes and CIs) • ‘Bootstrap’ the data?
Bootstrapping • Random re-sampling from observed data to produce a simulated but more stable outcome (see Larson-Hall & Herrington, 2010) • (More) robust to: outliers, non-normal data common • Larson-Hall & Herrington (2010) • ANOVA: p>.05 between NSs (n=15) and 3 learner groups (n=14, 15, 15) • Tukey post hocs: p < .05 ONLY between NSs and Group A (p = .002); pb = .407; pc = .834 • Bootstrappedpost hoc tests p < .05for all three groups • pvalues non-statistical due to a lack of power; Type II error • Plonsky et al. (in press) • Re-analyzed raw data from 26 primary L2 studies • 4 (of 16) Type I ‘misfits’ (i.e., 25% Type I ‘misfit’ rate) • 0 Type II ‘misfits’ • Too much power (via large N) inflated findings?
BUT EVEN WITH GREATER POWER VIA BOOTSRTAPPING, OUR RESULTS ARE STILL BASED ON THE FLAWED NOTION OF STATISTICAL SIGNIFICANCE
EFFECT SIZES!(Or: a MUCH BETTER solution to our obsession with pvalues)
Effect Sizes Questions we’ll address • What are they? How do we calculate them? • What advantages do ESs provide over p values? • How can we interpret ESs?
What is an effect size? • A quantitative indication of the strength of a relationship or an effect • Common effect sizes • Standardized mean differences (Cohen’s d) • M1-M2 / SDpooled (see Excel macro for calculating d) • Correlation coefficients (e.g., r) • Shared variance (R2, eta2) • Odds Ratios (likelihood of A given B) • Percentages
Why Effect Sizes?- An alternative to NHST (p) - Null Hypothesis Significance Testing (p) vs. Effect Sizes (d) • Unreliable: result dependent on sample size (e.g., Kline, 2009) ESs: not dependent on N • Crude and uninformative: a) forces continuous data into a yes/no dichotomy; b) tells us nothing about practical significance or magnitude (e.g., Cohen, 1994) ESs: Express magnitude/size of relationship (i.e., WHAT WE REALLY WANT TO KNOW) • Arbitrary: …surely, God loves the .06 nearly as much as the .05 (Rosnow & Rosenthal, 1989, p. 1277) ESs: Continuousand can be compared/combined across studies
Research Questions and Their Answers Using NHST vs. ESs • Think of a study you read recently or one that you’re working on. • What were the RQs? • Where they phrased dichotomously (Do …? Is there a difference …?)? • If so, what kind of answer can come from such a RQ? • How might the findings differ with an emphasis on magnitude rather than presence/absence of a relationship or effect?
Why Effect Sizes?- Journal Requirements - • APA Publication Manual, 6th Edition • Three major L2 Journals: Language Learning, TESOL Quarterly, Modern Language Journal • Plonsky & Gass (2011):0%(1980s) 0%(1990s) 27%(2000s) • Plonsky(2013): 3%(1990s) 42%(2000s) So now effect sizes get reported more often… • ?
…but very rarely do we interpret them What do they mean anyway? How big is ‘big’? And how small is ‘small’? What does d = 0.50 (or 0.10, or 1.00…) mean? What implications do these effect have for future research, theory, and practice? SMALL BIG
ESs: Summary Empirically-based, field-specific scale for dvalues in L2 research • ESs are best understood in relation to other, field-specific effects • d ≈ 0.40 (small) • d ≈ 0.70 (medium) • d≈ 1.00 (large) • …if people interpreted effect sizes [using fixed benchmarks] with the same rigidity that .05 has been used in statistical testing, we would merely be being stupid in another metric(Thompson, 2001, pp. 82–83). • Additional considerations: • Theoretical and methodological maturity (over time) • SD units • Research setting (labvs. classroom; SL vs. FL) • Length/intensity of treatment • Manipulation of IVs • Publication bias • Sample size / sampling error • Instrument reliability
A Revised Model of Research Conduct a study (e.g., the effects of A on B) p < 0.05 d = ? p > 0.05 d = ? Trash Accumulation of results (via meta-analysis) More precise and reliable estimate of effects Modify relevant theory, research, practice
Based on our discussion today, what changes would you suggest to the field?
10 Suggestions for Reform • A diminished reliance on NHST / p-values • Drop the “significant” from “statistically significant” • Focus on the practical and theoretical importance of results • Better educate ourselves and future generations of researchers Emphasize: ESs, alternatives to NHST, synthetic-mindedness in primary research De-emphasize NHST • ESs (for all findings, not only when p < .05) • CIs (for all findings, not only when p < .05) “a quiet but insistent reminder that no knowledge is complete or perfect”(Sagan, 1996) • Replication (to mitigate effects of low power) • Examine data visually • Meta-analysis / a synthetic approach • Initiative from the top down
Further Reading • Beyond significance testing (Kline, 2013) • The cult of statistical significance (McCloskey, 2008) • Understanding the new statistics (Cumming, 2012) • Effect sizes for research (Grissom & Kim, 2012, 2nd ed.) • Statistical power analysis for the behavioral sciences (Cohen, 1988, 2nd ed.)
Connections to Other Topics to be Discussed this Week • Meta-analysis (relies on ES) rather than p values (TUESDAY) • Replication (THURSDAY) • Reporting practices (full descriptives including ES, always; data transparency, etc.) (FRIDAY)
Tomorrow: Meta-analysis • Motivation for and benefits of (conceptual understanding) • Procedures/techniques (practical understanding)