BAD

THE THE AND THE BAD GOOD UGLY F. Kaftandjieva

F. Kaftandjieva F. Kaftandjieva

CEFR F. Kaftandjieva

Bad Practice Good Practice F. Kaftandjieva

Terminology Alignment Anchoring Calibration Projection Scaling Comparability Concordance Linking Benchmarking Prediction Equating Moderation F. Kaftandjieva

1904 Milestones in Comparability “The proof and measurement of associationbetween two things“ association Spearman F. Kaftandjieva

1951 1904 Milestones in Comparability “Scores on twoor more tests may be said to be comparable for a certain population if they show identicaldistributions for that population.” comparable population Flanagan Spearman F. Kaftandjieva

1971 1951 1904 Milestones in Comparability • ‘Scales, norms, and equivalent scores’: • Equating • Calibration • Comparability Angoff Flanagan Spearman F. Kaftandjieva

1992 1993 1971 1951 1904 Milestones in Comparability Linking Mislevy, Linn Angoff Flanagan Spearman F. Kaftandjieva

1997 2001 1992 1993 1971 1951 1904 Milestones in Comparability Alignment Webb, Porter Mislevy, Linn Angoff Flanagan Spearman F. Kaftandjieva

Alignment • Alignment refers to the degree of match between test content and the standards • Dimensions of alignment • Content • Depth • Emphasis • Performance • Accessibility F. Kaftandjieva

Alignment • Alignment is related to content validity • Specification (Manual – Ch. 4) • “Specification … can be seen as a qualitative method. … There are also quantitative methods for content validation but this manual does not require their use.” (p. 2) • 24 pages of forms • Outcome: “A chart profiling coverage graphically in terms of levels and categories of CEF.” (p. 7) • Crocker, L. et al. (1989). Quantitative Methods for Assessing the Fit Between Test and Curriculum. In: Applied Measurement in Education, 2 (2), 179-194. Why? How? F. Kaftandjieva

0.235 Alignment (Porter, 2004) www.ncrel.org F. Kaftandjieva

1997 2001 1992 1993 1971 1951 1904 Milestones in Comparability Linking Webb, Porter Mislevy, Linn Angoff Flanagan Spearman F. Kaftandjieva

Mislevy & Linn: Linking Assessments Equating  Linking F. Kaftandjieva

The Good & The Bad in Calibration F. Kaftandjieva

Model – Data Fit F. Kaftandjieva

Model – Data Fit Reality Models F. Kaftandjieva

Sample-Free Estimation F. Kaftandjieva

The ruler (θ scale) F. Kaftandjieva

The ruler (θ scale) absolute zero boiling water F. Kaftandjieva

The ruler (θ scale) F° = 1.8 * C° + 32 C° = (F°– 32) / 1.8 F. Kaftandjieva

Mislevy & Linn: Linking Assessments F. Kaftandjieva

Standard Setting F. Kaftandjieva

The Ugly F. Kaftandjieva

Fact 1: • Human judgment is the epicenter of every standard-setting method Berk, 1995 F. Kaftandjieva

When Ugliness turns to Beauty F. Kaftandjieva

Fact 2: • The cut-off points on the latent continuum do not possess any objective reality outside and independently of our minds. They are mental constructs, which can differ within different persons. F. Kaftandjieva

Consequently: • Whether the levels themselves are set at the proper points is a most contentious issue and depends on the defensibility of the procedures used for determining them Messick, 1994 F. Kaftandjieva

Evidence Claims Defensibility Evidence Claims F. Kaftandjieva

National Standards Understands manuals for devices used in their everyday life CEF – A2 Can understand simple instructions on equipment encountered in everyday life – such as apublic telephone (p. 70) Defensibility: Claims vs. Evidence (A2) F. Kaftandjieva

Defensibility: Claims vs. Evidence • Cambridge ESOL • DIALANG • Finnish Matriculation • CIEP (TCF) • CELI Universitа per Stranieri di Perugia • Goethe-Institut • TestDaF Institut • WBT (Zertifikat Deutsch) 75% of the institutions provide only claims about item's CEF level F. Kaftandjieva

Defensibility: Claims vs. Evidence • Common Practice (Buckendahl et al., 2000) • External Evaluation of the alignment of • 12 tests by 2 publishers • Publisher reports: • No description of the exact procedure followed • Reports include only the match between items and standards • Evaluation study • At least 10 judges per test • Comparison results • % of agreement: 26% - 55% • Overestimation of the match by test-publishers F. Kaftandjieva

Standards for educational and psychologicaltesting,1999 Standard 1.7: • When a validation rests in part of theopinion or decisions of expert judges, observers or raters,procedures for selecting such experts and for elicitingjudgments or ratings should be fully described. The descriptionof procedures should include any training andinstruction provided, should indicate whether participantsreached their decisions independently, and should reportthe level of agreement reached. If participants interactedwith one another or exchanged information, the proceduresthrough which they may have influenced oneanother should be set forth. F. Kaftandjieva

Evaluation Criteria Hambleton, R. (2001). Setting Performance Standards on Educational Assessmentsand Criteria for Evaluating the Process. In: Setting Performance Standards: Concepts, Methods and Perspectives., Ed. by Cizek, G., Lawrence Erlbaum Ass., 89-116. • A list of 20 questions as evaluation criteria • Planning & Documentation 4 (20%) • Judgments 11 (55%) • Standard Setting Method 5 (25%) Planning F. Kaftandjieva

Judges • Because standard-setting inevitably involves human judgment, a central issue is who is to make these judgments, that is, whose values are to be embodied in the standards. Messick, 1994 F. Kaftandjieva

Selection of Judges The judges should have • the right qualifications, but • some other criteria such as • occupation, • working experience, • age, • sex may be taken into account, because ‘… although ensuring expertise is critical, sampling from relevant different constituencies may be an important consideration if the testing procedures and passing scores are to be politically acceptable’ (Maurer & Alexander, 1992). F. Kaftandjieva

Number of Judges • Livingston & Zieky (1982) suggest the number of judges to be not less than 5. • Based on the court cases in the USA, Biddle (1993) recommends 7to10Subject Matter Expertsto be used in the Judgement Session. • As a general rule Hurtz & Hertz (1999) recommend10 to 15 raters to be sampled. • 10 judges is a minimum number, according to the Manual (p. 94). F. Kaftandjieva

Training Session • The weakest point • How much? • Until it hurts (Berk, 1995) • Main focus • Intra-judge consistency • Evaluation forms • Hambleton, 2001 • Feedback ? ? F. Kaftandjieva

Training Session: Feedback Form F. Kaftandjieva

Standard Setting Method • Good Practice • The most appropriate • Due diligence • Field tested • Reality check • Validity evidence • More than one F. Kaftandjieva

Standard Setting Method • Probably the only point of agreementamong standard-setting gurus is that there is hardly anyagreement between results of any two standard-setting methods,even when applied to the same test under seemingly identicalconditions. Berk, 1995 F. Kaftandjieva

He that increaseth knowledge increaseth sorrow. (Ecclesiastes1:18) Examinee-centered methods B1/B2 Test-centered methods F. Kaftandjieva

He that increaseth knowledge increaseth sorrow. (Ecclesiastes1:18) Test-centered methods B1/B2 Examinee-centered methods F. Kaftandjieva

Instead of Conclusion • In sum, it may seem that providing valid grounds for valid inferences in standards-based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences. Messick, 1994 Butterfly Effect Change one thing, change everything! F. Kaftandjieva

BAD

BAD

Presentation Transcript

Bad graphs

Bad Air

More Bad Reasoning and Bad Rhetoric

Extinction: Bad Genes or Bad Luck?

Bad Journalism

BAD BIR

Bad Habits

Super Bad 

Bad smoking

Bad Design:

Bad kitty

BAD ASTRONOMY

Bad Science

Bad ,bad, bad design

Bad

Bad Religion

Bad

Breaking bad

BAD WEATHER

Bad

Bad Drupal, Bad!

BAD CONSTRUCTION!