«A chi- square test showed that ...» – or did it really ?

«A chi-square test showedthat...» – or did it really? Bård Uri Jensen http://privat.hihm.no/buj/ bard.jensen@hihm.no

Allowing [statisticalsoftware] to do ourthinking is a sure recipe for disaster. (Good & Hardin, 2012, p. xi) - or did it really?

«Simple» statistical tests • chi-square (X2) test • t-test - or did it really?

Statistical hypothesis testing • Formulate a hypothesis • E.g. In Norwegian L2, Vietnamese have more TENSE errorsthan Somali. • Formulate a null-hypothesis • Vietnamese and Somalis have the same rate of TENSE errors. • «Disprove» the null-hypothesis = demonstrateitsunlikelihood • E.g. less than 5% chance for the null-hypothesis to be true • = «Significance» • Wechooseαaccording to whatweconsider an acceptable risk of false conclusions • Often 5% in linguisticresearch - or did it really?

Conditionsofuse • Independentobservations • chi-square test • t-test • Parametric assumptions • t-test • The dangersofrepeated testing • any test - or did it really?

A simple example from ornithology - or did it really?

A simple example from corpuslinguistics - or did it really?

A simple example from corpuslinguistics • The observationsshould be independent. • An importantconditionofuse for • chi-squared test • t-test • The observationsshould be of different individuals. «Chi-square is a much-abused test in secondlanguageresearch studies, and oftenoneofitsassumptions (thatofindependenceof data) is violated as a matter ofcourse.» Larson-Hall (2010, p.206) - or did it really?

Example 1: Chi-squared test, non-independentobservations • Blom & Paradis 2013 • Journal of Speech, Language, and Hearing Research • On past tense production in L2 children with language impairment • 48 children with English as L2 • Overregularization of past tense • Hypothesis: Less common in verb stems ending in /d/ or /t/ • X 2 (1) = 3.45, p (one-sided) = 0.032 • Problem: n = 85 + 140, N = 48 • Observations are not independent, so the result is invalid. - or did it really?

Example 1: Chi-squared test, non-independentobservations • Solution A: • Pick just oneobservation from eachauthor/speaker • “To exclude the author as one more relevant factor, the database was cleaned so that there is only one example for each verb from any single author.” Sokolova 2012, p. 94 - or did it really?

Example 1: Chi-squared test, non-independentobservations • Solution A: • Pick just oneobservation from eachauthor/speaker • Sokolova 2012 • Solution B: • Calculateaveragevalues for each informant • Usetheaveragevalues as independentobservations • Test significancewith an appropriate test, e.g. t-test or U-test • Gujord 2013 • Boththesesolutionsmightrequire a largercorpus! • «Solution» C: • Alter theresearchquestion • Danckaert 2011 - or did it really?

Example 1: Chi-squared test, non-independentobservations • Solution B: - or did it really?

Example 2:T-test, non-independentobservations • Klavan 2012 • PhDthesis from Tartu University • Investigationofadposition ‘peal’ and adessive case • 450 observationsofeach, from 2 corpora • t = 8.02, p < 0.001 • Conclusion: adessivephrasesare longer than ‘peal’-phrases • Problem: Observationsare not independent. • The conclusion is invalid. - or did it really?

- or did it really?

Example 3: T-test, non-normal populations • Hunter (2011, s. 48) • PhDthesis from Birmingham University • On grammaticalityjudgements by L2 students • Conclusion: • the accuracy (max. = 1) for the teacher group (M = .98, SD = .14) was significantly higher than the student group (M = .64, SD = .49), t(1) = 4.9, p < .001. • Problem: • Mean = 0.98, Maximum value = 1 • Standard deviation= 0.14 • The distribution cannot possibly be normal. • The result is invalid. - or did it really?

- or did it really?

Example 4Repeated testing • Leedham 2011 • PhDthesis, The Open University • Features in thewritingofChinese students in UK universities • Conclusion: • Therearedifferences in frequenciesofcertainphrasesbetween 3rdyear students and younger students • Problem: • Repeated testing withoutadjustingtheprobabilityvalues • Someoftheresultsare not valid. - or did it really?

CV CV - or did it really?

Moral Thereareno simple tests. • Youshould understand theconditionsofthe test. • Youshouldtaketheconditionsintoaccount. • Youshoulddocumentproperly • howyouperformthe test, • whatnumbersyouputinto it, • howtheconditionsare met. «A chi-square test showed that the difference is significant.» - or did it really?

Is it reallythatimportant? • «[C]ompared to othersocialsciences (e.g., psychology, communication, sociology, anthropology, …) or branchesoflinguistics (e.g., psycholinguistics, phonetics, sociolinguistics…), most ofcorpuslinguistics has paradoxicallyonlybegun to developthismethodologicalawareness.» Gries (forthcoming, p.1) - or did it really?

Is it reallythatimportant? • «It has become increasingly apparent over a period of several years that psychologists, taken in the aggregate, employ the chi-square test incorrectly.» Lewis and Burke (1949) - or did it really?

Whoseresponsibility is it? - or did it really?

«Corpus linguistics needs to ‘catch up’ [...]» Gries (forthcoming, p.1) - or did it really?

References (http://privat.hihm.no/buj) Boneau, A. C. (1960). The effects of violations of assumptions underlying the t test. Psychological Bulletin, 57(1), 49-64. Good, P.I. & Hardin, J.W. (2012). Common errors in statistics (and how to avoid them). Hoboken: John Wiley. Gries, S (forthcoming). Quantitative designs and statistical techniques. http://www.linguistics.ucsb.edu/faculty/stgries/research/InProgr_STG_QuantDesAndMethCorpLing_CUPHb.pdf Larson-Hall, J. (2010). A Guide to Doing Statistics in Second Language Research Using SPSS. New York: Routledge. Lewis, D., & Burke, C. J. (1949). The use and misuse of the chi-square test. Psychological Bulletin, 46(6), 433-489. Blom & Paradis (2013). Past Tense Production by English Second Language Learners With and Without Language Impairment. In Journal of Speech, Language, and Hearing Research. 56, 281-294. Danckaert, L. (2011). On the left periphery of Latin embedded clauses. Ph.D. thesis. University of Gent. Gujord, A.H. (2013). Grammatical encoding of past time in L2 Norwegian : The roles of L1 influence and verb semantics. Ph.D. thesis. University of Bergen. Hunter, J.D. (2011). A multi-method investigation of the effectiveness and utility of delayed corrective feedback in second-language oral production. Ph.D. thesis. University of Birmingham. Klavan, j. (2012). Evidence in linguistics : corpus-linguistic and experimental methods for studying grammatical synonymy. Ph.D. thesis. University of Tartu. Leedham, M. (2011). A corpus-driven study of features of Chinese students’ undergraduate writing in UK universities. Ph.D. thesis. The Open University. Sokolova, S. (2012). Asymmetries in Linguistic Construal : Russian Prefixes and the Locative Alternation. Ph.D. thesis. University of Tromsø. - or did it really?

«A chi- square test showed that ...» – or did it really ?