Half a century of observing and doing language assessment and testing: Some lessons learned?

Half a century of observing and doing language assessment and testing: Some lessons learned? Sauli Takala University of Gothenburg September 21, 2010

Lee J. Cronbach, one of the giants in measurement, testing and evaluation, gave a Valedictory at the Board of Testing and Assessment first anniversary conference in 1994. • It was intended to consist of casual reminiscences of the olden days. He was planning to say farewell and to convey how glad he was to see others fight the testing battles. • He found that the plan for lightsome retrospection did not hold up. ”Measurement specialists live in interesting times, and there are important matters to talk about.” He was going to draw on past experiences extending back to his first involvement with a national educational assessment in 1939, • but his topic was really ”clear and present dangers”. • Is there some lesson for me/us here? What might be clear and present dangers in language testing and assessment?

Some lightsome reminiscences of the past close to 50 years • How it all started – 1965; from philology to testing/assessment • Lado (1961) and Educational Measurement (1951) to rescue • Sweden – Torsten Lindblad and other Swedish colleagues • IEA – Six subject survey (English, French): late1960s/early 1970s • IEA 6-week international seminar on curriculum development in Gränna August 1971; director of program: Benjamin S. Bloom (HFS) • My main mentors: Robert Lado, John B. Carroll, Rebecca Valette, Alan Davies, Bernard Spolsky, Albert Pilliner, Sandra Savignon, Lyle Bachman, Charles A.; John Trim (CoE projects) • Spolsky ´s AILA talk on ”art or science” in 1976; AILA/Lund/Krashen • Elana Shohamy´s preliminary edition of practical handbook (1985) – oral testing topical; her book was very helpful • IEA International Study of Writing 1981-1984 (UIUC); L1; tested by Lyle; Ed Psych, Center for the Study of Reading • ACTFL, ILR ; FSI (”mother of all scales”) • DIALANG; Felly, Norman, Charles, Neus...; diagnosis, CEFR scales • Ingrian returnees: different culture; computerized testing • EALTA: Felly and Norman; Neus and Gudrun • CoE: Manual, Reference Supplement; 2002-

Some general reflections based on this experience: • Language testing/assessment has developed new approaches reflecting developments in eg. (applied) linguistics, psychology, test theory and also due to social/educational changes. (norm-ref., crit.ref., self-assessment; perf/authentic..) • Changes in the view of the purposes/uses of testing assessment in education -> a broader/more responsive view (testing/assessment of, in, for education/learning) • The stakeholders have become more varied and their role is better acknowledged -> increased transparency, co-operation • Growing awareness of the impacts of testing/assessment -> standards, codes of ethics/good practice, international associations (ILTA, ALTE, EALTA...), washback, power, big business... • Relationship between education and economic development: national assessments, international comparative assessments (new methodological challenges..) • Technology and its impact on testing/assessment

No man is an island .....

J.L.M. Trim 2002

EALTA Executive Committee: Turku, January 2009

EALTA 6th Conference, Turku, June, 2009

Some things I´ve learned – I think

Some FAQs • How many? – texts, tasks, items, raters... • How difficult can/ should tasks/items be? • How well should tasks/items discriminate? • How reliable should (sub)tests be? • How high agreement should there be among raters/judges? • How many different test/assessment formats should be used? • What I have learned: No simple answers!! Some sensible guidelines...

Language testing and assessment is carried out in many contexts and for many purposes → There are a variety of forms/formats of testing and assessment. • There is no one best test format and there is hardly any inherently bad/wrong format, either. • Language testing cultures differ and what is acceptable/normal in one testing culture may be a taboo in another. Fashions also influence practice. • The challenge is to make the format fit the purpose (cf. Gilbert & Sulliva, The Mikado). • Thus, the purpose is where everything starts.

Examination frame- work CEFR Legal framework Test theory Applied ling Testing know-how There are ManyFactors in Test Development Test specifications Test construction Training of item writers Raters, Interviewers Monitoring Feedback Research Test administration and rating of performances

More questions... • So, why test/assess (at all)? Why not just teach and study? Functions of evaluation. (of, in, for) • Should good testing/assessment simply be the same as good teaching? Can they be basically identical? Do they have to fulfil the same criteria? • How far should teaching determine testing/ assessment and how strongly should assumed washback effect do so? • What are reasonable criteria of good testing/assessment? • Good questions? Do we have any good answers beyond ”It depends”?

A brief and non-technical definition of criteria of good practice in testing and assessment: all testing/assessment • should give the testtaker a good opportunity to show his/her language competence (this opportunity should be fair and perceived to be fair) • should produce reliable information about real (de facto) language competence so that the interpretations and decisions arrived at are appropriate and warranted • should be practical, effective, economical, useful and beneficial

A simple four-phase model of testing/assessment(1) • Pre-response phase and its main tasks/ activities: What? • Response phase:How? • Post-response phase: How good (are the results)? • Reflective phase: How did it work? How can it be improved?

A simple four-phase model (2) • Are the phases of equal importance? • I argue they are not. • The first phase, the quality of pre-response preparatory work, is most important. It is the input to the subsequent phases. GIGO-effect. Knowhow counts! • No statistical wizardry can turn a bad item/task into a good one. It´s too late! • Pilot testing/pretesting is not always poss-ible. →All the more urgent: quality assurance from the start. Thorough ”stimulus analysis”.

First Phase: pre-response tasks • planning • specifications • item/task review • Are re reasonably happy about what we will ask learners/users to respond to?

A simple four-phase model (3): Item/Task Review • Cross-check drafts across languages: have all test developers who know a language sufficiently well read and comment on drafts (English, French, German, Spanish....)-> comparability • Ask item writers and reviewers discuss items using an agreed checklist, eg. • The difficulty of the passage on which the item is based • The clarity of the item (lack of clarity may lead to irrelevant impact on difficulty): a rough scale could be: very clear, not fully clear • The number of plausible options: 4, 3, 2, 1 (MC) • The amount of text that the correct option is based on: 1-2 sentences/lines, 3-5 sentences/lines, more than 5 sentences/ lines • The amount of inference needed: very little, some, considerable • Such analysis could lead to useful discussions.

Phase 1 - Item Review Aims to: Provide information (based on expert’s judgments) about test items/tasks in regard to their: • Content validity • Anticipated level of difficulty • Fairness • Technical quality F. Kaftandjieva

Does the item contain any information that could be seen as offensive to any specific group? 2. Does the item include or imply any stereotypic depiction of any group? 3. Does the item portray any group as degraded in any way? 4. Does the item contain clues or information that could be seen to work to the benefit or detriment of any group? 5. Does the item contain any group-specific language or vocabulary (e.g., culture-related expressions, slang, or expressions that may be unfamiliar to examinees of either sex or of a particular age)? GROUPS: gender socio-economic racial regional cultural religious ethnic handicapped age other Item Review: Fairness F. Kaftandjieva

Item Review: Technical Quality • Does the item/task conform to the specifications in content and format? • Are the directions clear, concise and complete? • Is the item/task clear? • Does the item/taks conform to standard grammar and usage • Is the item/task independent of other items? • Is the item/task free of unintended clues to the correct answer? • Is the item/asks free of tricky expressions or options? • Is the item/task free of extraneous or confusing material? • Is the item/task free of other obvious flaws in construction? • Is the format unusual or complicated such that it interferes with students ‘ ability to answer the item correctly? • Is a student’s prior knowledge other than of the subject area being tested necessary to answer the item/task? • Is the item/task content inaccurate or factually incorrect? F. Kaftandjieva

A simple four-phase model (4) • What are the functions of the subsequent phases? • Response phase: elicit/provide good, sufficient and fair samples of language performance • Post-response phase: score/rate performances: sufficient agreement/ reliability; sufficient statistical analyses • Reflective phase: should not be neglect-ed; vital for development of know-how

A simple four-phase model (I5) • How to develop allround know-how: • One needs to be well-read on relevant literature (basic references, journals). • One needs to have adequate basic statistical knowledge. • One needs to be involved and interested in all phases. • One needs continuous and concrete feedback on one´s contributions. • In sum: one needs (1) an adequate theoretical foundation, (2) solid practical evidence-based experience (feedback), and (3) reflection. Experience counts!

A simple four-phase model (6): Avoiding pitfalls – some lessons learned • Avoid asking questions which require a response of personal preferences, tastes, values (vulgar vs elegant: Felly: traffic item!) • If a task/item proves difficult to construct and even after repeated efforts it feels less good, it is usually best to drop it, as it will usually not work. It is faster to write a new one! vs. Don´t touch/abuse my items! • Beware of developing a narrow routine, using the same kind of approach (”personal fingerprint”).

Useful to know/be aware... Balancing between opportunities and constraints

Traditional view of what the reliability of a test depends on: • The length of the test. More evidence, more reliable. Increasing length is no certain guarantee. Quality counts How well the items/tasks discriminate. • 2) Discrimination: varies depending on item difficulty. Good discrimination vs. ”appropriate” difficulty? • How homogeneous/heterogenous the test takers are as a group. More variance -> higher reliability.

Ebel (1965, 89): appropriate difficulty; traditional approach • items of intermediate difficulty (p= 50%) discriminate well and enhance reliality • due to the possibility of guessing – ideal p-value: • 75% for true-false items • 67% for a 3-alternative MC items • 62% for a 4-alternative MC items • Ebel, Robert L. (1965/1979) Essentials of Educational Measurement. Englewood Cliffs, N.J: Prentice-Hall. (excellent basic reference)

Relationship between item difficulty and maximum discrimination power One lesson learned: aim at well-discriminating items/tasks.

Test length is related to reliability Measurement error due to items (+/-) is balanced better with more items. Results derived from longer tests can usually be relied on more than from shorter tests. Spearman-Brown prediction formula: example 25-item test with a reliability of .70 -> 35 items .766, 43 items .80, 54 items .90. Such additional items need to be similar to the orginal items, ie. homogeneous complementation. No free or even cheap lunch! By the same token, a 43-item test with a relaibility of .80 can be homogeneously reduced to a 20-item test with a reliability of .70. (The risk of trying to economise)

Relationship between reliability and the number of cut scores/performance groups Souce: Felianka Kaftandjieva, 2008 (2010)

The importance of good discrimination: example If several cut scores are to be set, some well-discrimating (rather/very) easy and difficult items are needed.

Swedish B: LC/2004/ Item 1 (n = 14231; P = 71%; D = +0.37)

RuotsiB, s04; KY/Osio 2(n = 14231; P = 68%; D = +0.20)

Swedish B:LC, 2004, Item1 (n = 14231; P = 71%; D = +0.37)

The importance of being good • having good specifications to work on (a blueprint) • having an appropriate selection of tasks: reading, listening, speaking, writing, interaction: choice of topics, text types/genres • having an appropriate selection of cognitive operations (levels of processing)

The importance of being good: Some ways of doing this • having clear instructions for tasks • having an appropriate range of test/ assessment formats • having relevant scoring criteria (prepared simultaneously with tasks, not afterwards; revise if necessary); inform test takers of them in an appropriate manner • having a good competence in scoring/ rating (training, feedback – experience)

Scoring Criteria should be: • Easily understood • Relevant to the learning outcome • Compatible with other criteria used in the rubric • Precise • Representative of the vocabulary of the discipline • Observable, requiring minimal interpretation • Unique, not overlapping with another criterion or trait F. Kaftandjieva

The 3R Formula of Item Quality F. Kaftandjieva

Text coverage – readability/compehensibility of texts. How do vocabulary size, text length and sample size influence the stability of estimating coverage? 23 samples from British National Corpus, 26 texts of different length from Time Almanac, estimated using 10 different sample sizes with 1000 iterations. Means and standard deviations. Text coverage is more stable when vocabulary size is larger, text length bigger and several samples are used. Stability is greater when several shorter texts are used than fewer longer texts. Intermediate level proficiency (3000 words): one long text would require about 1750 words to reach sufficient stability while the same result can be achieved with 4 texts of 250 words (1000 words) and with 9 50-word texts (450 words). -> It seems to pay to have several texts of varying length Chujo, K. & Utiyama, M. (2005) Understanding the role of text length, sample size and vocabulary size in determining text coverage, Reading in a foreign language, Vol. 17, No 1, April 2005 (online journal at http://nflrc.hawaii.edu/rfl)

Writing the choices: Use as Many Distractors as Possible but Three Seems to be a Natural Limit A growing body of research supports the use of three options for conventional MC items (Andres & del Castillo, 1990; Bruno & Dirkzwager, 1995; Downing & Haladyna, 1997; Landrum, Cashin & Theis, 1993; Lord, 1977; Rodriguez, 1997; Sax & Michael, 1991; Trevisan, Sax & Michael, 1991, 1994). One issue is the way distractors perform with test-takers. A good distractor should be selected by low achievers and ignored by high achievers. As chapters 8 and 9 show, a science of option response validation exists and is expanding to include more graphical methods. To summarize this research on the correct number of options, evidence exists to suggest a slight advantage to having more options per test item, but only if each distractor is doing its job. Haladyna & Downing (1996) found that the number of useful distractors per item on the average, for a well-developed standardized test was between one and two. Another implication of this research is that three options may be a natural limit for most MC items. Thus, item writers are often frustrated in finding a useful fourth or fifth option because they do not exist. “The option of despair”

The advice given here is that one should write as many good distractors as one can, but should expect that only one or two will really work as intended. It does not matter how many distractors one produces for any given MC item but it does matter that each distractor performs as intended. This advice runs counter to most standardized testing programs. Customarily, answer sheets are used with a predetermined number of options, such as four or five. However, both theory and research support the use of one or two distractors, so the existence of nonperforming distractors is nothing more than window dressing. Thus, test developers have the dilemma of producing unnecessary distractors, which do not operate as they should, for the appearance of the test, versus producing tests with varying degrees of options. One criticism of using fewer instead of more options for an item is that guessing plays a greater role in determining a student’s score. The use of fewer distractors will increase the chances of a student guessing the right answer. However, the probability that a test-taker will increase his or her score significantly over a 20-, 50-, or 100-item test by pure guessing is infinitesimal. The floor of a test containing three options per item for a student who lacks knowledge and guesses randomly throughout the test is 33% correct. Therefore, administering more test items will reduce the influence of guessing on the total test score, This logic is sound for two-option items as well, because the floor of the scale is 50% and the probability of a student making 20, 50, or 100 successful randomly correct guesses is very close to zero. (pp. 89-90)

To summarize: There is NOitem format that is appropriate for all purposes and on all occasions. F. Kaftandjieva

Instead of conclusions Reflections

Reflect on: • what is the key requirement (fair/equal?) • what test developers do – ought to do • theory & practice (hen – egg) • comprehension – indicators of demonstrated competence (cf.Wittgenstein –inner processes) • writing – usually testing drafting competence? • optimization – satisficing (good practice; good enough; improvement) (Herbert Simon) • optimization – avoiding avoidable pitfalls • checklists, templates • keep everything as simple as possible • first-hand experience – evidence of having made progress

Half a century of observing and doing language assessment and testing: Some lessons learned?

Half a century of observing and doing language assessment and testing: Some lessons learned?

Presentation Transcript

Lessons Learned in the Establishment of a Vulnerability Assessment Program

Implementing Rapid Testing on Labor and Delivery Units in San Diego County, Lessons Learned

Assessment

1 st Half of the 20 th Century Walkaway Review

Digital Change in Publishing: Lessons Learned in the US

Lessons Learned

Welcome to………

Lessons learned from THORPEX THORPEX working group on Data Assimilation and Observing Strategies

Lessons Learned in PBL Implementation

Lessons Learned With Oracle9 i on Linux for S/390

LTPP Lessons Learned: National Experiment

Lessons learned on the self-assessment process: some FAQs

Language loss = cultural loss

Language testing and assessment

HLS: Lessons Learned and Proposed Changes

Lessons Learned Part I

Implementing Rapid Testing on Labor and Delivery Units in San Diego County, Lessons Learned

21 st Century Lessons

ENHANCING TRAINING AND TESTING EFFECTIVENESS THROUGH LESSONS LEARNED

FFMIA - Lessons Learned from the FFMIA Compliance Assessment Process

VoIP Penetration Testing: Lessons Learned, Tools and Techniques

LANGUAGE PROFICIENCY TESTING Lecture # 25