550 likes | 786 Views
THE. THE. AND. THE. BAD. GOOD. UGLY. F. Kaftandjieva. CEFR. Bad Practice. Good Practice. Terminology. Alignment. Anchoring. Calibration. Projection. Scaling. Comparability. Concordance. Linking. Benchmarking. Prediction. Equating. Moderation. 1904.
E N D
THE THE AND THE BAD GOOD UGLY F. Kaftandjieva
F. Kaftandjieva F. Kaftandjieva
CEFR F. Kaftandjieva
Bad Practice Good Practice F. Kaftandjieva
Terminology Alignment Anchoring Calibration Projection Scaling Comparability Concordance Linking Benchmarking Prediction Equating Moderation F. Kaftandjieva
1904 Milestones in Comparability “The proof and measurement of associationbetween two things“ association Spearman F. Kaftandjieva
1951 1904 Milestones in Comparability “Scores on twoor more tests may be said to be comparable for a certain population if they show identicaldistributions for that population.” comparable population Flanagan Spearman F. Kaftandjieva
1971 1951 1904 Milestones in Comparability • ‘Scales, norms, and equivalent scores’: • Equating • Calibration • Comparability Angoff Flanagan Spearman F. Kaftandjieva
1992 1993 1971 1951 1904 Milestones in Comparability Linking Mislevy, Linn Angoff Flanagan Spearman F. Kaftandjieva
1997 2001 1992 1993 1971 1951 1904 Milestones in Comparability Alignment Webb, Porter Mislevy, Linn Angoff Flanagan Spearman F. Kaftandjieva
Alignment • Alignment refers to the degree of match between test content and the standards • Dimensions of alignment • Content • Depth • Emphasis • Performance • Accessibility F. Kaftandjieva
Alignment • Alignment is related to content validity • Specification (Manual – Ch. 4) • “Specification … can be seen as a qualitative method. … There are also quantitative methods for content validation but this manual does not require their use.” (p. 2) • 24 pages of forms • Outcome: “A chart profiling coverage graphically in terms of levels and categories of CEF.” (p. 7) • Crocker, L. et al. (1989). Quantitative Methods for Assessing the Fit Between Test and Curriculum. In: Applied Measurement in Education, 2 (2), 179-194. Why? How? F. Kaftandjieva
0.235 Alignment (Porter, 2004) www.ncrel.org F. Kaftandjieva
1997 2001 1992 1993 1971 1951 1904 Milestones in Comparability Linking Webb, Porter Mislevy, Linn Angoff Flanagan Spearman F. Kaftandjieva
Mislevy & Linn: Linking Assessments Equating Linking F. Kaftandjieva
The Good & The Bad in Calibration F. Kaftandjieva
Model – Data Fit F. Kaftandjieva
Model – Data Fit F. Kaftandjieva
Model – Data Fit Reality Models F. Kaftandjieva
Sample-Free Estimation F. Kaftandjieva
The ruler (θ scale) F. Kaftandjieva
The ruler (θ scale) F. Kaftandjieva
The ruler (θ scale) F. Kaftandjieva
The ruler (θ scale) absolute zero boiling water F. Kaftandjieva
The ruler (θ scale) F° = 1.8 * C° + 32 C° = (F°– 32) / 1.8 F. Kaftandjieva
Mislevy & Linn: Linking Assessments F. Kaftandjieva
Standard Setting F. Kaftandjieva
The Ugly F. Kaftandjieva
Fact 1: • Human judgment is the epicenter of every standard-setting method Berk, 1995 F. Kaftandjieva
When Ugliness turns to Beauty F. Kaftandjieva
When Ugliness turns to Beauty F. Kaftandjieva
Fact 2: • The cut-off points on the latent continuum do not possess any objective reality outside and independently of our minds. They are mental constructs, which can differ within different persons. F. Kaftandjieva
Consequently: • Whether the levels themselves are set at the proper points is a most contentious issue and depends on the defensibility of the procedures used for determining them Messick, 1994 F. Kaftandjieva
Evidence Claims Defensibility Evidence Claims F. Kaftandjieva
National Standards Understands manuals for devices used in their everyday life CEF – A2 Can understand simple instructions on equipment encountered in everyday life – such as apublic telephone (p. 70) Defensibility: Claims vs. Evidence (A2) F. Kaftandjieva
Defensibility: Claims vs. Evidence • Cambridge ESOL • DIALANG • Finnish Matriculation • CIEP (TCF) • CELI Universitа per Stranieri di Perugia • Goethe-Institut • TestDaF Institut • WBT (Zertifikat Deutsch) 75% of the institutions provide only claims about item's CEF level F. Kaftandjieva
Defensibility: Claims vs. Evidence • Common Practice (Buckendahl et al., 2000) • External Evaluation of the alignment of • 12 tests by 2 publishers • Publisher reports: • No description of the exact procedure followed • Reports include only the match between items and standards • Evaluation study • At least 10 judges per test • Comparison results • % of agreement: 26% - 55% • Overestimation of the match by test-publishers F. Kaftandjieva
Standards for educational and psychologicaltesting,1999 Standard 1.7: • When a validation rests in part of theopinion or decisions of expert judges, observers or raters,procedures for selecting such experts and for elicitingjudgments or ratings should be fully described. The descriptionof procedures should include any training andinstruction provided, should indicate whether participantsreached their decisions independently, and should reportthe level of agreement reached. If participants interactedwith one another or exchanged information, the proceduresthrough which they may have influenced oneanother should be set forth. F. Kaftandjieva
Evaluation Criteria Hambleton, R. (2001). Setting Performance Standards on Educational Assessmentsand Criteria for Evaluating the Process. In: Setting Performance Standards: Concepts, Methods and Perspectives., Ed. by Cizek, G., Lawrence Erlbaum Ass., 89-116. • A list of 20 questions as evaluation criteria • Planning & Documentation 4 (20%) • Judgments 11 (55%) • Standard Setting Method 5 (25%) Planning F. Kaftandjieva
Judges • Because standard-setting inevitably involves human judgment, a central issue is who is to make these judgments, that is, whose values are to be embodied in the standards. Messick, 1994 F. Kaftandjieva
Selection of Judges The judges should have • the right qualifications, but • some other criteria such as • occupation, • working experience, • age, • sex may be taken into account, because ‘… although ensuring expertise is critical, sampling from relevant different constituencies may be an important consideration if the testing procedures and passing scores are to be politically acceptable’ (Maurer & Alexander, 1992). F. Kaftandjieva
Number of Judges • Livingston & Zieky (1982) suggest the number of judges to be not less than 5. • Based on the court cases in the USA, Biddle (1993) recommends 7to10Subject Matter Expertsto be used in the Judgement Session. • As a general rule Hurtz & Hertz (1999) recommend10 to 15 raters to be sampled. • 10 judges is a minimum number, according to the Manual (p. 94). F. Kaftandjieva
Training Session • The weakest point • How much? • Until it hurts (Berk, 1995) • Main focus • Intra-judge consistency • Evaluation forms • Hambleton, 2001 • Feedback ? ? F. Kaftandjieva
Training Session: Feedback Form F. Kaftandjieva
Training Session: Feedback Form F. Kaftandjieva
Standard Setting Method • Good Practice • The most appropriate • Due diligence • Field tested • Reality check • Validity evidence • More than one F. Kaftandjieva
Standard Setting Method • Probably the only point of agreementamong standard-setting gurus is that there is hardly anyagreement between results of any two standard-setting methods,even when applied to the same test under seemingly identicalconditions. Berk, 1995 F. Kaftandjieva
He that increaseth knowledge increaseth sorrow. (Ecclesiastes1:18) Examinee-centered methods B1/B2 Test-centered methods F. Kaftandjieva
He that increaseth knowledge increaseth sorrow. (Ecclesiastes1:18) Test-centered methods B1/B2 Examinee-centered methods F. Kaftandjieva
Instead of Conclusion • In sum, it may seem that providing valid grounds for valid inferences in standards-based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences. Messick, 1994 Butterfly Effect Change one thing, change everything! F. Kaftandjieva