1 / 28

Test tasks for speaking – balancing between authenticity and reliability

Test tasks for speaking – balancing between authenticity and reliability. Raili Hildén , University of Helsinki, Finland Raili.hilden@helsinki.fi TBLT 2009 Lancaster ‘Tasks: context, purpose and use’ 3rd Biennial International Conference on Task-Based Language Teaching 13-16 September 2009.

taariq
Download Presentation

Test tasks for speaking – balancing between authenticity and reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Test tasks for speaking – balancing between authenticity and reliability RailiHildén, University of Helsinki, Finland Raili.hilden@helsinki.fi TBLT 2009Lancaster ‘Tasks: context, purpose and use’3rd Biennial International Conference on Task-Based Language Teaching 13-16 September 2009

  2. Background: Hy-talkproject of speakingassessment • The project is funded by the University of Helsinki • To validate the illustrativescales of speakingincluded in the national corecurricula for general education and uppersecondarylevelbytrialing a prototypetest of speaking. • Subscales: overall task completion, fluency, pronunciation, range and accuracy is empirically aligned to relevant scales of the CEFR. • http://blogs.helsinki.fi/hy-talk/ Raili Hildén 15.9.2009

  3. The conceptual framework • Validity argumentation scheme for interpretation of the HY-Talk project data (adapted from Kane, 2001, Fulcher& Davidson, 2007, 164 – 174; Bachman, 2005) • The claim to be probed: “The illustrative scales of descriptors of oral proficiency included in the national core curricula for language education enable sufficiently valid conclusions on students´ oral proficiency in general school education in Finland.” Raili Hildén 15.9.2009

  4. The purpose of the HY-Talk study • The validityclaim is supported and challengedbywarrants and rebuttalsregarding • relevance • utility • (Intendedconsequences) • sufficiency Raili Hildén 15.9.2009

  5. Warrants • The tasks used to elicit student performance correspond to pedagogic tasks and target language use tasks of students at the age of general education. (utility) • Reliability of assessments based on the scale and the tasks to elicit performances is found to be high enough. (sufficiency) Raili Hildén 15.9.2009

  6. Backing to support the utilityclaim • Rater and test taker feedback confirm the perceived authenticity of the tasks and appropriateness of administration. • The level ratings correspond to the target levels in the curricula. Raili Hildén 15.9.2009

  7. Backing data to support the sufficiencyclaim • Statistical reliability evidence confirm sufficient level of consistency across raters, tasks and languages, and interlocutors. Raili Hildén 15.9.2009

  8. Counterclaims • The tasks used to elicit student performance correspond inadequately to pedagogic tasks or TLU tasks of students. (utility) • The link to the scale descriptors may be weak. (utility) • The level assignments do not match the target levels set in the curricula. • Reliability of assessments is not stable, but varies too much across tasks, raters or languages, or is caused by intervening variables or inadequate evidence base. (sufficiency) Raili Hildén 15.9.2009

  9. Reubuttal data to support the utilityclaim • Statistical evidence challenge the intended utility of the tasks. • Verbal data from students and teachers question the utility and/or sufficiency of the tasks for the purpose. Raili Hildén 15.9.2009

  10. Researchquestions 1. How is the inter-raterreliability of the judgements? 2. Howare the tasks and correspondingsalienttaskfeaturesrelated to targetleveljudgements, assessmentcriteria and theircombination? (numeric data, analysedwithFacets) 3. Howare the tasksperceivedbystudents and raters? (verbal data based on feedback sheets and audiorecordedratingsessions) Raili Hildén 15.9.2009

  11. SpeakingTasks • Tasksweredesigned to reflect the averagetargetlevelspecified for goodmastery of the syllabus • English (grade 7: A1.3, grade 1: A2.2) • German etc. (grade 7: A1.2, grade 1: A2.1) • Theyalsodraw on the thematiccontent of the curricula • Discussed, revised and pilotedby the projectgroup Raili Hildén 15.9.2009

  12. Prototypetasks (withexamples) 1. Presentation (A2.2) partlycontrolledmonologue 2. Everyday life (A2.1 – A2.2) rigidlycontrolleddialogues • At the airport, grade 7 • At home, grade 7 • Accommodation, grade 1 • On the way home, grade 1 3. Negotiation: partlycontrolledidaloguePlanning an outing (A2.1 – B1.1) Raili Hildén 15.9.2009

  13. SpeakingTasks • Prompts in L1 • Time on task 10-15 min, • Conducted in pairs • Ratedby 5-10 languageexperts Raili Hildén 15.9.2009

  14. Data of thisstudy • Speechsamples in English (56) • Speechsamples in German (66) Raili Hildén 15.9.2009

  15. Facetsexamined in thisstudy • Raters (5 English, 7 German) • Tasks 1-4 • Taskdimensions • Overall task performance • Fluency • Pronunciation • Range • Accuracy Raili Hildén 15.9.2009

  16. Results: RQ1 englishsamples:overallinter-rateragreement • Majority of totalratingswereplacedbetweenlevels 5-6 (CEFR A2-B1) • Acrossallfacets the raters the distancebetween the mostsevere and the mostlenientraterwas 1 logit (levels 5/6) • Average of ratingsgivenby R4 6.66 • Average of ratingsgivenby R1 5.87 • For moredetailedrecordpleasecontact the author. Raili Hildén 15.9.2009

  17. Results: RQ1 englishsamples:overalltaskdifficulty ”The easiest” task: • Presentationwasassigned the highestfairaverage of 6.29 ”The trickiest” task: • Everyday life task ”Accommodation” wasassigned the lowestfairaverage of 6.21 • For moredetailedrecordpleasecontact the author. Raili Hildén 15.9.2009

  18. Results: RQ1 englishsamplescriteria • ”The easiest” criterion: Pronunciation (fairaverage 6.39) • ”The trickiest” criterion: Range (fairaverage 6.02) For moredetailedrecordpleasecontact the author. Raili Hildén 15.9.2009

  19. Results: RQ1 englishsamplescombineddifficulty =task+criteria ”The easiest” combination • Presentation + Accuracy • Presentation+ Fluency ”The trickiest” combination: • Everydaysituation: Accommodation + Range • For moredetailedrecordpleasecontact the author. Raili Hildén 15.9.2009

  20. Results: RQ1 germansamples:overallinter-rateragreement • Majority of totalratingswereplacedbetweenlevels 5-6/10 (CEFR A2-B1) • Acrossallfacets and raters, the distancebetween the mostsevere and the mostlenientraterwas 1 logit (levels 5-6) • Average of ratingsgivenby R6 (3.96/10) • Average of ratingsgivenby R2 (3.57/10) • For moredetailedrecordpleasecontact the author. Raili Hildén 15.9.2009

  21. Results: RQ1 germansamples:overalltaskdifficulty ”The easiest” task: • Presentationtaskwasassigned the highestfairaverage of 4.21/10 ”The trickiest” task: • Everyday life task ”On the way home” wasassigned the lowestfairaverage of 3.57/10 • For moredetailedrecordpleasecontact the author. Raili Hildén 15.9.2009

  22. Results: RQ1 germansamplescriteria • ”The easiest” criterion: Pronunciation 4.24/10 (fairaverage ) • ”The trickiest” criterion: Range 3.49/10 (fairaverage ) For moredetailedrecordpleasecontact the author. Raili Hildén 15.9.2009

  23. Results: RQ1 germansamplescombineddifficulty =task+criteria ”The easiest” combination • Presentation + Pronunciation (level 6=B1.1) ”The trickiest” combination: • Negotiation (Planning an outing) + Range (level 5 = A2.2 lowerband) • For moredetailedrecordpleasecontact the author. Raili Hildén 15.9.2009

  24. Rq2: english & german • The taskswereconceived as authentic in regard to themes and situations • Authenticity (Bachman & Palmer, 1996) wasquestionedbyratersduring the sessionsdue to the highgrade of controlregulatedby the L1 prompts (to increasereliability) • Studentsregarded the tasks as relevant and highlyprobable in real life. • The raters of Germandiscussed the interlocutorimpact of the pairsetting as a biasingfactor. • The resultssuggestthat the targetlevelrequirements set in the Finnishcurriculaareattainedreasonablywell. Raili Hildén 15.9.2009

  25. discussion • Utilityclaimwasconfirmed as to the highlevel of agreement of ratersacrossfacets (reliability) • Sufficiency and relevancewerepartlyquestioneddue to the claimedunauthenticity of the task (rigor of instructions) • How to goabout the dilemma in the futureversions of the test? Raili Hildén 15.9.2009

  26. references • Bachman. L.F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2(1), 1–34. • Fulcher, G. & Davidson, F. (2007). Language Testing and Assessment. An advanced resource book. Abington & New York: Routledge. • Hildén, R. & Takala, S. 2007. Relating Descriptors of the Finnish School Scale to the CEF Overall Scales for Communicative Activities. Teoksessa Koskensalo, A., Smeds, J., Kaikkonen, P. & Kohonen, V. (toim.) Foreign languages and multicultural perspectives in the European context; Fremdsprachen und multikulturellePerspektivenimeuropäischenKontext. Dichtung, Wahrheit und Sprache (ss. 73 – 88). LIT-Verlag. Raili Hildén 15.9.2009

  27. bibliography • National Core Curriculum for the Comprehensive School 2004. Helsinki: Finnish National Board of Education. In Finnishhttp://www.oph.fi/info/ops/ • National Core Curriculum fortheUpperSecondary Level 2003. Helsinki: Finnish National Board of Education. In Finnish • http://www.oph.fi/pageLast.asp?path=1,17627,1830,23059 • Kane, M. D. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38 (4), 319 – 342. Raili Hildén 15.9.2009

  28. Thankyou! raili.hilden@helsinki.fi

More Related