Psychometric Aspects of Linking Tests to the CEF

Psychometric Aspects of Linking Tests to the CEF Norman Verhelst National Institute for Educational Measurement (Cito) Arnhem – The Netherlands

Overview • What is the problem • CEF • Psychometric problems • Internal validation • Reliability • Dimensionality • Linking • Test equating • Standard setting

Problem 1: What is the CEF?(descriptive) • Classification system of linguistic acts (in a foreign language (?)) • Basic building blocks: can do statements • Multifaceted • skills (listening, reading,... table 2, p.26-27) • qualitative aspects (range, coherence, table 3, p. 28) • contexts (personal, educational, table 5, p. 48)

Solution 1: What could we do if the CEF were nothing else than a descritpive system? • Determination problem: is a concrete performance (linguistic act) an exemplar of a can do statement? • We place a check mark along the can do statement that is exempified by the performance, forming thus a complicated high-dimensional profile • Probably we would encounter problems of consistency (e.g. being successful in only one of two exemplars of the same can do statement)

Problem 2: The CEF is also a hierarchic system • Three main classicications: Basic, Independent and Proficient (A, B and C) • Further subdivisions: A1, A2, B1, B2, C1 and C2 (cumulative and therefore ordered) • Implications • Language proficiency is measurable using this system • It implies a (rudimental) theory of language acquisition

The Linking Problem • Devise a structured observation of linguistic acts (a test) • Assign a person to one of the levels A1,...,C2 • using his/her test performance • using ‘objective’ rules • in such a way that the assigned level ‘B1’ corresponds to the ‘real’ level B1 as defined in the CEF • The ‘Manual’ tells you how to proceed.

Internal validation • Restriction to itemized tests • ‘Universal simplification’: test performance is summarized by a single number, the test score • Typical result: a score (on my test) in the range 21 to 32 corresponds to a B1 • Why should one have confidence in your test?

Reliability • Every measurement contains an error • True score is the average score over a (huge) number of similar test administrations • True scores and measurement errors are not ‘knowable’ in particular cases • One can know something ‘in general’ • Basic notion: Reliability coefficient

Reliability coefficient (Rel) • Rel is the correlation between the scores on two parallel tests. It lies in [0,1] • Reliability is a characteristic of the test in some population • We do not compute Rel (in the population) but (in a sample): estimation error • Establishing the reliability with a single test administration is very hard (Cronbach’s )

Rel should be high(ish)?

Example continued (SEM = 3)

Can do at level B1 Can recognise significant points in straightforward newspaper articles on familiar subjects Can do at level B1 Can understand clearly written, straightforward instructions for a piece of equipment Dimensionality

Associated Problems • Is a test consisting of 20 exemplars (items) of the ‘black’ can do equivalent to a test consisting of 20 exemplars of the ‘blue’ can do? • If ‘Yes’: How do we know this? • If ‘No’: What is the justification of placing the two can do’s at the same level (B1)? • Maybe the score on the ‘blue test’ is partly determined by the trait ‘technical insight’. • The previous hypothesis can be tested empirically

Multidimensionality: techniques • If two tests measure the same concept, they will generally not perfectly correlate, because of measurement error. • Correction for attenuation:

Factor Analysis • Is the basic technique to reveal the dimensionality structure of a test battery • Has a lot of variants, some technically very complicated • The basic notions should be mastered by every scholar in language testing • Reference: Section F of the Reference Supplement to the Manual.

Transition (1) • The concepts discussed so far refer to internal validation, but also to external validation • The blue test of the example must not be a test of technical insight • An informative factor analysis will include other than pure language tests (or subtests) • Provisional conclusion: my test is professionally constructed and measures proficiency in the sense described by the CEF • The items are well devised exemplars of can do statements • There is a well defined balance across qualitative aspects deemed important in the CEF • In this sense, the test is linked to the CEF.

Transition (2) • But we want more • Assignment to a CEF-level, using only test scores, e.g., • less than 55  ‘level A2 or lower’ • 55 or more  ‘level B1 or higher’ • 73 or more  ‘level B2 or higher’ • By implication [55,72] assigns to B1 • 55 and 73 are called cutting points (or cut-off scores) • Two classes of techniques • Test Equating • Standard Setting

Test Equating • To replace an ‘old’ test X by a ‘new’ test Y • Problem: find for each possible score on X an equivalent score on Y • Especially useful if X and Y are not equally difficult. • Many techniques, some very easy to apply • But...

Example: Equipercentile Equating

Standard Setting • If there is no test to equate with, somebody has to decide whether the minimum score for a ‘B1’ assignment is 55 or 54 or 61 or whatever. • Who is/are this somebody? • How do they decide? • Is their decision the ‘truth’? • Why would we trust such a decision?

Who is involved? • Not a single individual, but a group of persons (‘the larger the better’) • A group of experts, i.e. trained persons • They know the CEF • They recognize exemplars • A whole chapter of the manual is devoted to this problem.

How to decide? • Many standard setting methods documented in the literature (starting in 1954!) • See Section B of the Reference Supplement of the manual. • Test centered • Examinee centered

Standard Setting in Dialang • Suppose the target level is B1 • For each item, ask the following question: “Do you think a person at level B1 should be able to answer this item correctly?” • Count the number of items responded to by ’yes’. • Average across judges and you get the standard

Is their decision the truth? • ?

Why would we trust such decisions? • Intra-judge consistency • ‘Easy’ items (yes), ‘hard’ items (no) • Inter-judge consistency • Do you agree ‘in general’ with your colleagues • External validation • Maybe there is an external test • Overall judgment of the teacher

Psychometric Aspects of Linking Tests to the CEF