«A chi- square test showed that ...» – or did it really ?

«A chi- square test showed that ...» – or did it really ?. Bård Uri Jensen http://privat.hihm.no/buj/ bard.jensen@hihm.no. Allowing [ statistical software ] to do our thinking is a sure recipe for disaster . ( Good & Hardin , 2012, p. xi). «Simple» statistical tests.

«A chi- square test showed that ...» – or did it really ?

  «A chi-square test showedthat...» – or did it really? Bård Uri Jensen http://privat.hihm.no/buj/ bard.jensen@hihm.no

  Allowing [statisticalsoftware] to do ourthinking is a sure recipe for disaster. (Good & Hardin, 2012, p. xi)

  «Simple» statistical tests • chi-square (X2) test • t-test

  Statistical hypothesis testing • Formulate a hypothesis • E.g. In Norwegian L2, Vietnamese have more TENSE errorsthan Somali. • Formulate a null-hypothesis • Vietnamese and Somalis have the same rate of TENSE errors. • «Disprove» the null-hypothesis = demonstrateitsunlikelihood • E.g. less than 5% chance for the null-hypothesis to be true • = «Significance» • Wechooseαaccording to whatweconsider an acceptable risk of false conclusions • Often 5% in linguisticresearch

  Conditionsofuse • Independentobservations • chi-square test • t-test • Parametric assumptions • t-test • The dangersofrepeated testing • any test

  A simple example from corpuslinguistics • The observationsshould be independent. • An importantconditionofuse for • chi-squared test • t-test • The observationsshould be of different individuals. «Chi-square is a much-abused test in secondlanguageresearch studies, and oftenoneofitsassumptions (thatofindependenceof data) is violated as a matter ofcourse.» Larson-Hall (2010, p.206)

  Example 1: Chi-squared test, non-independentobservations • Blom & Paradis 2013 • Journal of Speech, Language, and Hearing Research • On past tense production in L2 children with language impairment • 48 children with English as L2 • Overregularization of past tense • Hypothesis: Less common in verb stems ending in /d/ or /t/ • X 2 (1) = 3.45, p (one-sided) = 0.032 • Problem: n = 85 + 140, N = 48 • Observations are not independent, so the result is invalid.

  Example 1: Chi-squared test, non-independentobservations • Solution A: • Pick just oneobservation from eachauthor/speaker • "To exclude the author as one more relevant factor, the database was cleaned so that there is only one example for each verb from any single author." Sokolova 2012, p. 94

  Example 1: Chi-squared test, non-independentobservations • Solution A: • Pick just oneobservation from eachauthor/speaker • Sokolova 2012 • Solution B: • Calculateaveragevalues for each informant • Usetheaveragevalues as independentobservations • Test significancewith an appropriate test, e.g. t-test or U-test • Gujord 2013 • Boththesesolutionsmightrequire a largercorpus! • «Solution» C: • Alter theresearchquestion • Danckaert 2011

  Example 1: Chi-squared test, non-independentobservations • Solution B:

  Example 2:T-test, non-independentobservations • Klavan 2012 • PhDthesis from Tartu University • Investigationofadposition 'peal' and adessive case • 450 observationsofeach, from 2 corpora • t = 8.02, p < 0.001 • Conclusion: adessivephrasesare longer than 'peal'-phrases • Problem: Observationsare not independent. • The conclusion is invalid.

  Example 3: T-test, non-normal populations • Hunter (2011, s. 48) • PhDthesis from Birmingham University • On grammaticalityjudgements by L2 students • Conclusion: • the accuracy (max. = 1) for the teacher group (M = .98, SD = .14) was significantly higher than the student group (M = .64, SD = .49), t(1) = 4.9, p < .001. • Problem: • Mean = 0.98, Maximum value = 1 • Standard deviation= 0.14 • The distribution cannot possibly be normal. • The result is invalid.

  Example 4Repeated testing • Leedham 2011 • PhDthesis, The Open University • Features in thewritingofChinese students in UK universities • Conclusion: • Therearedifferences in frequenciesofcertainphrasesbetween 3rdyear students and younger students • Problem: • Repeated testing withoutadjustingtheprobabilityvalues • Someoftheresultsare not valid.

  Moral Thereareno simple tests. • Youshould understand theconditionsofthe test. • Youshouldtaketheconditionsintoaccount. • Youshoulddocumentproperly • howyouperformthe test, • whatnumbersyouputinto it, • howtheconditionsare met. «A chi-square test showed that the difference is significant.»

  Is it reallythatimportant? • «[C]ompared to othersocialsciences (e.g., psychology, communication, sociology, anthropology, …) or branchesoflinguistics (e.g., psycholinguistics, phonetics, sociolinguistics…), most ofcorpuslinguistics has paradoxicallyonlybegun to developthismethodologicalawareness.» Gries (forthcoming, p.1)

  Is it reallythatimportant? • «It has become increasingly apparent over a period of several years that psychologists, taken in the aggregate, employ the chi-square test incorrectly.» Lewis and Burke (1949)

  «Corpus linguistics needs to 'catch up' [...]» Gries (forthcoming, p.1)

  References (http://privat.hihm.no/buj) Boneau, A. C. (1960). The effects of violations of assumptions underlying the t test. Psychological Bulletin, 57(1), 49-64. Good, P.I. & Hardin, J.W. (2012). Common errors in statistics (and how to avoid them). Hoboken: John Wiley. Gries, S (forthcoming). Quantitative designs and statistical techniques. http://www.linguistics.ucsb.edu/faculty/stgries/research/InProgr_STG_QuantDesAndMethCorpLing_CUPHb.pdf Larson-Hall, J. (2010). A Guide to Doing Statistics in Second Language Research Using SPSS. New York: Routledge. Lewis, D., & Burke, C. J. (1949). The use and misuse of the chi-square test. Psychological Bulletin, 46(6), 433-489. Blom & Paradis (2013). Past Tense Production by English Second Language Learners With and Without Language Impairment. In Journal of Speech, Language, and Hearing Research. 56, 281-294. Danckaert, L. (2011). On the left periphery of Latin embedded clauses. Ph.D. thesis. University of Gent. Gujord, A.H. (2013). Grammatical encoding of past time in L2 Norwegian : The roles of L1 influence and verb semantics. Ph.D. thesis. University of Bergen. Hunter, J.D. (2011). A multi-method investigation of the effectiveness and utility of delayed corrective feedback in second-language oral production. Ph.D. thesis. University of Birmingham. Klavan, j. (2012). Evidence in linguistics : corpus-linguistic and experimental methods for studying grammatical synonymy. Ph.D. thesis. University of Tartu. Leedham, M. (2011). A corpus-driven study of features of Chinese students' undergraduate writing in UK universities. Ph.D. thesis. The Open University. Sokolova, S. (2012). Asymmetries in Linguistic Construal : Russian Prefixes and the Locative Alternation. Ph.D. thesis. University of Tromsø.

