Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam dylanwiliam

2. Six degrees of integration: an agenda for joined-up assessmentDylan Wiliamwww.dylanwiliam.net Annual Conference of the Chartered Institute of Educational Assessors, London: 23 April 2008 Much of the debate about the improvement of systems of educational assessment focus on binaries. Is reliability more important than validity? Are constructed-response items better than multiple-choice items? Is teacher assessment better than externally-set tests and examinations? Is continuous assessment through coursework better than terminal examinations? In this talk, I will argue that as long as the debate is conducted in terms of such either/or issues, then progress will be slow, if not entirely absent. Rather, progress is to be made by mapping out the shades of grey between these extremes, understanding how each end of the spectrum is useful in helping us understand the spectrum, and the tensions we have to reconcile, but lethal as a goal in itself.Much of the debate about the improvement of systems of educational assessment focus on binaries. Is reliability more important than validity? Are constructed-response items better than multiple-choice items? Is teacher assessment better than externally-set tests and examinations? Is continuous assessment through coursework better than terminal examinations? In this talk, I will argue that as long as the debate is conducted in terms of such either/or issues, then progress will be slow, if not entirely absent. Rather, progress is to be made by mapping out the shades of grey between these extremes, understanding how each end of the spectrum is useful in helping us understand the spectrum, and the tensions we have to reconcile, but lethal as a goal in itself.

3. Overview Six degrees of integration Function Formative versus summative Quality Validity versus reliability Format Multiple-choice versus constructed response Scope Continuous versus one-off Authority Teacher-produced versus expert-produced Locus School-based versus externally marked

4. FunctionQualityFormatScopeAuthorityLocus

5. A statement of the blindlingly obvious You can�t work out how good something is until you know what it�s intended to do� Function, then quality

6. Formative and summative Descriptions of Instruments Purposes Functions

7. Gresham�s law and assessment Usually (incorrectly) stated as �Bad money drives out good� �The essential condition for Gresham's Law to operate is that there must be two (or more) kinds of money which are of equivalent value for some purposes and of different value for others� (Mundell, 1998) The parallel for assessment: Summative drives out formative The most that summative assessment (more properly, assessment designed to serve a summative function) can do is keep out of the way

8. FunctionQualityFormatScopeAuthorityLocus

9. Reliability Reliability is a measure of the stability of assessment outcomes under changes in things that (we think) shouldn�t make a difference, such as marker/rater occasion item selection

10. Test length and reliability

11. Reliability is not what we really want Take a test which is known to have a reliability of around 0.90 for a particular group of students. Administer the test to the group of students and score it Give each student a random script rather than their own Record the scores assigned to each student What is the reliability of the scores assigned in this way? 0.10 0.30 0.50 0.70 0.90

12. Reliability v consistency Classical measures of reliability are meaningful only for groups are designed for continuous measures Marks versus grades Scores suffer from spurious accuracy Grades suffer from spurious precision Classification consistency A more technically appropriate measure of the reliability of assessment Closer to the intuitive meaning of reliability

13. Reliability & classification consistency Classification consistency of National Curriculum Assessment in England

14. Validity Traditional definition: a property of assessments A test is valid to the extent that it assesses what it purports to assess Key properties (content validity) Relevance Representativeness Fallacies Two tests with the same name assess the same thing Two tests with different names assess different things A test valid for one group is valid for all groups

15. Trinitarian doctrines of validity Content validity Criterion-related validity Concurrent validity Predictive validity Construct validity

16. Validity Validity is a property of inferences, not of assessments �One validates, not a test, but an interpretation of data arising from a specified procedure� (Cronbach, 1971; emphasis in original) The phrase �A valid test� is therefore a category error (like �A happy rock�) No such thing as a valid (or indeed invalid) assessment No such thing as a biased assessment Reliability is a pre-requisite for validity Talking about �reliability and validity� is like talking about �swallows and birds� Validity includes reliability

17. Modern conceptions of validity Validity subsumes all aspects of assessment quality Reliability Representativeness (content coverage) Relevance Predictiveness But not impact (Popham: right concern, wrong concept)

18. Consequential validity? No such thing! As has been stressed several times already, it is not that adverse social consequences of test use render the use invalid, but, rather, that adverse social consequences should not be attributable to any source of test invalidity such as construct-irrelevant variance. If the adverse social consequences are empirically traceable to sources of test invalidity, then the validity of the test use is jeopardized. If the social consequences cannot be so traced�or if the validation process can discount sources of test invalidity as the likely determinants, or at least render them less plausible�then the validity of the test use is not overturned. Adverse social consequences associated with valid test interpretation and use may implicate the attributes validly assessed, to be sure, as they function under the existing social conditions of the applied setting, but they are not in themselves indicative of invalidity. (Messick, 1989, p. 88-89)

19. Threats to validity Inadequate reliability Construct-irrelevant variance Differences in scores are caused, in part, by differences not relevant to the construct of interest The assessment assesses things it shouldn�t The assessment is �too big� Construct under-representation Differences in the construct are not reflected in scores The assessment doesn�t assess things it should The assessment is �too small� With clear construct definition all of these are technical�not value�issues But they interact strongly�

20. School effectiveness Do differences in exam results support inferences about school quality? Key issues: Value-added Sensitivity to instruction Learning is slower than generally assumed Sensitivity to instruction of tests is exacerbated by test-construction procedures Result: invalid attributions about the effects of schooling

21. Learning is hard and slow�

22. Why does this matter? In England, school-level effects account for only 7% of the variability in GCSE scores. In terms of value-added, there is no statistically significant difference between the middle 80 percent of English secondary schools Correlation between teacher quality and student progress is low: Average cohort progress: 0.3 sd per year Good teachers (+1 sd) produce 0.4 sd per year Poor teachers (-1 sd) produce 0.2 sd per year

23. So� Although teacher quality is the single most important determinant of student progress� �the effect is small compared to the accumulated achievement over the course of a learner�s education� �so that inferences that school outcomes are indications of the contributions made by the school are unlikely to be valid.

24. FunctionQualityFormatScopeAuthorityLocus

25. Item formats �No assessment technique has been rubbished quite like multiple choice, unless it be graphology� Wood, 1991, p. 32) Myths about multiple-choice items They are biased against females They assess only candidates� ability to spot or guess They test only lower-order skills

26. Comparing like with like� Constructed-response items Can be improved through guidance to markers Can be developed cheaply, but are expensive to score For a one-hour year-cohort assessment in England Development: �5 000 Scoring: �1 000 000 Multiple-choice items Cannot be improved through guidance to markers Can be developed cheaply, but are cheap to score For a one-hour year-cohort assessment in England Development: �1 000 000? Scoring: �5 000

27. Mathematics 1 What is the median for the following data set? 38 74 22 44 96 22 19 53 22 38 and 44 41 46 77 This data set has no median Q8-61-05 Key: C (Mis)conceptions: median has to be a number in the data set, cannot calculate median with an even number of elements (B) And (F) would not be included � stick out as longer � neither correctQ8-61-05 Key: C (Mis)conceptions: median has to be a number in the data set, cannot calculate median with an even number of elements (B) And (F) would not be included � stick out as longer � neither correct

28. Mathematics 2 What can you say about the means of the following two data sets? Set 1: 10 12 13 15 Set 2: 10 12 13 15 0 The two sets have the same mean. The two sets have different means. It depends on whether you choose to count the zero. Q4-67-02 Key: B (Mis)conception � added a zero to data set does not impact the mean � A and C variations of each other Q4-67-02 Key: B (Mis)conception � added a zero to data set does not impact the mean � A and C variations of each other

29. Mathematics 3 Q4-52-03 Key: B, C, D, E (Mis)conception � any line at 45 degree to horizontal.What is missed is also important � C and E could not be diagonals because of what is NOT selected MORE THAN ONE CORRECT ANSWER � not standard number of answer choices Combination of responses is also important � (b) is different from (c) and (d) and (e)Q4-52-03 Key: B, C, D, E (Mis)conception � any line at 45 degree to horizontal.What is missed is also important � C and E could not be diagonals because of what is NOT selected MORE THAN ONE CORRECT ANSWER � not standard number of answer choices Combination of responses is also important � (b) is different from (c) and (d) and (e)

30. Science The ball sitting on the table is not moving. It is not moving because: no forces are pushing or pulling on the ball. gravity is pulling down, but the table is in the way. the table pushes up with the same force that gravity pulls down gravity is holding it onto the table. there is a force inside the ball keeping it from rolling off the table

31. Science 2 You look outside and notice a very gentle rain. Suddenly, it starts raining harder. What happened? A cloud bumped into the cloud that was only making a little rain. A bigger hole opened in the cloud, releasing more rain. A different cloud, with more rain, moved into the area. The wind started to push more water out of the clouds. Q4-42-03 Key: C (Mis)conception: rain occurs when clouds bang into each other; rain comes from holes in the clouds; Answer D not a misconception but plausible logic might get from high winds associated with low pressure areas and hence with rain � but it is not a causal relationship Answers stated as a student might say themQ4-42-03 Key: C (Mis)conception: rain occurs when clouds bang into each other; rain comes from holes in the clouds; Answer D not a misconception but plausible logic might get from high winds associated with low pressure areas and hence with rain � but it is not a causal relationship Answers stated as a student might say them

32. Science 3 Jenna put a glass of cold water outside on a warm day. After a while, she could see small droplets on the outside of the glass. Why was this? The air molecules around the glass condensed to form droplets of liquid The water vapor in the air near the cold glass condensed to form droplets of liquid water Water soaked through invisible holes in the glass to form droplets of water on the outside of the glass The cold glass causes oxygen in the air to become water Key: BKey: B

33. Science 4 How could you increase the temperature of boiling water? Add more heat. Stir it constantly. Add more water. You can�t increase the temperature of boiling water.

34. Science 5 What can we do to preserve the ozone layer? Reduce the amount of carbon dioxide produced by cars and factories Reduce the greenhouse effect Stop cutting down the rainforests Limit the numbers of cars that can be used when the level of ozone is high Properly dispose of air-conditioners and fridges

35. English Where would be the best place to begin a new paragraph?

36. English 2 In a piece of persuasive writing, which of these would be the best thesis statement? The typical TV show has 9 violent incidents There is a lot of violence on TV The amount of violence on TV should be reduced Some programs are more violent than others Violence is included in programs to boost ratings Violence on TV is interesting I don�t like the violence on TV The essay I am going to write is about violence on TV

37. History Why are historians concerned with bias when analyzing sources? People can never be trusted to tell the truth People deliberately leave out important details People are only able to provide meaningful information if they experienced an event firsthand People interpret the same event in different ways, according to their experience People are unaware of the motivations for their actions People get confused about sequences of events


39. The Lake Wobegon effect revisited

40. Effects of narrow assessment Incentives to teach to the test Focus on some subjects at the expense of others Focus on some aspects of a subject at the expense of others Focus on some students at the expense of others (�bubble� students) Consequences Learning that is Narrow Shallow Transient


42. Authority Reliability requires random sampling from the domain of interest Increasing reliability requires increasing the size of the sample Using teacher assessment in certification is attractive: Increases reliability (increased test time) Increases validity (addresses aspects of construct under-representation) But problematic Lack of trust (�Fox guarding the hen house�) Problems of biased inferences (construct-irrelevant variance) Can introduce new kinds of construct under-representation


44. Locus Using external markers to mark student assessments involves spending more money in order to deny teachers professional learning opportunities Getting teachers involved in �common assessment� Is not assessment for learning, nor formative assessment But it is valuable, perhaps even essential, professional development

45. Final reflections

46. The challenge To design an assessment system that is: Distributed So that evidence collection is not undertaken entirely at the end Synoptic So that learning has to accumulate Extensive So that all important aspects are covered (breadth and depth) Manageable So that costs are proportionate to benefits Trusted So that stakeholders have faith in the outcomes

47. Constraints and affordances Beliefs about what constitutes learning; Beliefs in the reliability and validity of the results of various tools; A preference for and trust in numerical data, with bias towards a single number; Trust in the judgments and integrity of the teaching profession; Belief in the value of competition between students; Belief in the value of competition between schools; Belief that test results measure school effectiveness; Fear of national economic decline and education�s role in this; Belief that the key to schools� effectiveness is strong top-down management;

48. The minimal take-aways� No such thing as a summative assessment No such thing as a reliable test No such thing as a valid test No such thing as a biased test �Validity including reliability�

Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam dylanwiliam

Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam dylanwiliam

Presentation Transcript

The importance of learning and teaching to the Institute of Education Dylan Wiliam

Hertfordshire Gifted and Talented Conference, 25 February 2009

LAPS symposium discussion

Using assessment to support learning: why, what and how?

Using assessment to support learning: why, what and how?

Embedding formative assessment with teacher learning communities

Bob Dylan

Formative assessment: definitions and relationships

Principled curriculum and assessment design: Tools for schools Dylan Wiliam

Establishing successful t eacher l earning communities: Lessons learned

Getting better together: the role of professional learning communities

How do we prepare our students for a world we cannot possibly imagine?

Why formative assessment needs to be a priority for every school

How to Build Learning Progressions: Formative Assessment’s Basic Blueprints Presentation 3

Classroom Assessment: Minute-by-minute and day-by-day

How can assessment support learning?

Notes towards a theory of formative assessment

Dylan Wiliam Annual conference of the British Educational Research Association; London, UK: 2007

Bob Dylan

Formative e-assessment: some theoretical resources Dylan Wiliam dylanwiliam

Why teaching will never be a research-based profession and why that’s a Good T hing