470 likes | 705 Views
Overview. Six degrees of integrationFunctionFormative versus summativeQualityValidity versus reliabilityFormatMultiple-choice versus constructed responseScopeContinuous versus one-offAuthorityTeacher-produced versus expert-producedLocusSchool-based versus externally marked. Function Qual
E N D
2. Six degrees of integration: an agenda for joined-up assessmentDylan Wiliamwww.dylanwiliam.net Annual Conference of the Chartered Institute of Educational Assessors, London: 23 April 2008 Much of the debate about the improvement of systems of educational assessment focus on binaries. Is reliability more important than validity? Are constructed-response items better than multiple-choice items? Is teacher assessment better than externally-set tests and examinations? Is continuous assessment through coursework better than terminal examinations? In this talk, I will argue that as long as the debate is conducted in terms of such either/or issues, then progress will be slow, if not entirely absent. Rather, progress is to be made by mapping out the shades of grey between these extremes, understanding how each end of the spectrum is useful in helping us understand the spectrum, and the tensions we have to reconcile, but lethal as a goal in itself.Much of the debate about the improvement of systems of educational assessment focus on binaries. Is reliability more important than validity? Are constructed-response items better than multiple-choice items? Is teacher assessment better than externally-set tests and examinations? Is continuous assessment through coursework better than terminal examinations? In this talk, I will argue that as long as the debate is conducted in terms of such either/or issues, then progress will be slow, if not entirely absent. Rather, progress is to be made by mapping out the shades of grey between these extremes, understanding how each end of the spectrum is useful in helping us understand the spectrum, and the tensions we have to reconcile, but lethal as a goal in itself.
3. Overview Six degrees of integration
Function
Formative versus summative
Quality
Validity versus reliability
Format
Multiple-choice versus constructed response
Scope
Continuous versus one-off
Authority
Teacher-produced versus expert-produced
Locus
School-based versus externally marked
4. FunctionQualityFormatScopeAuthorityLocus
5. A statement of the blindlingly obvious You can’t work out how good something is until you know what it’s intended to do…
Function, then quality
6. Formative and summative Descriptions of
Instruments
Purposes
Functions
7. Gresham’s law and assessment Usually (incorrectly) stated as “Bad money drives out good”
“The essential condition for Gresham's Law to operate is that there must be two (or more) kinds of money which are of equivalent value for some purposes and of different value for others” (Mundell, 1998)
The parallel for assessment: Summative drives out formative
The most that summative assessment (more properly, assessment designed to serve a summative function) can do is keep out of the way
8. FunctionQualityFormatScopeAuthorityLocus
9. Reliability Reliability is a measure of the stability of assessment outcomes under changes in things that (we think) shouldn’t make a difference, such as
marker/rater
occasion
item selection
10. Test length and reliability
11. Reliability is not what we really want Take a test which is known to have a reliability of around 0.90 for a particular group of students.
Administer the test to the group of students and score it
Give each student a random script rather than their own
Record the scores assigned to each student
What is the reliability of the scores assigned in this way?
0.10
0.30
0.50
0.70
0.90
12. Reliability v consistency Classical measures of reliability
are meaningful only for groups
are designed for continuous measures
Marks versus grades
Scores suffer from spurious accuracy
Grades suffer from spurious precision
Classification consistency
A more technically appropriate measure of the reliability of assessment
Closer to the intuitive meaning of reliability
13. Reliability & classification consistency Classification consistency of National Curriculum Assessment in England
14. Validity Traditional definition: a property of assessments
A test is valid to the extent that it assesses what it purports to assess
Key properties (content validity)
Relevance
Representativeness
Fallacies
Two tests with the same name assess the same thing
Two tests with different names assess different things
A test valid for one group is valid for all groups
15. Trinitarian doctrines of validity Content validity
Criterion-related validity
Concurrent validity
Predictive validity
Construct validity
16. Validity Validity is a property of inferences, not of assessments
“One validates, not a test, but an interpretation of data arising from a specified procedure” (Cronbach, 1971; emphasis in original)
The phrase “A valid test” is therefore a category error (like “A happy rock”)
No such thing as a valid (or indeed invalid) assessment
No such thing as a biased assessment
Reliability is a pre-requisite for validity
Talking about “reliability and validity” is like talking about “swallows and birds”
Validity includes reliability
17. Modern conceptions of validity Validity subsumes all aspects of assessment quality
Reliability
Representativeness (content coverage)
Relevance
Predictiveness
But not impact (Popham: right concern, wrong concept)
18. Consequential validity? No such thing! As has been stressed several times already, it is not that adverse social consequences of test use render the use invalid, but, rather, that adverse social consequences should not be attributable to any source of test invalidity such as construct-irrelevant variance. If the adverse social consequences are empirically traceable to sources of test invalidity, then the validity of the test use is jeopardized. If the social consequences cannot be so traced—or if the validation process can discount sources of test invalidity as the likely determinants, or at least render them less plausible—then the validity of the test use is not overturned. Adverse social consequences associated with valid test interpretation and use may implicate the attributes validly assessed, to be sure, as they function under the existing social conditions of the applied setting, but they are not in themselves indicative of invalidity. (Messick, 1989, p. 88-89)
19. Threats to validity Inadequate reliability
Construct-irrelevant variance
Differences in scores are caused, in part, by differences not relevant to the construct of interest
The assessment assesses things it shouldn’t
The assessment is “too big”
Construct under-representation
Differences in the construct are not reflected in scores
The assessment doesn’t assess things it should
The assessment is “too small”
With clear construct definition all of these are technical—not value—issues
But they interact strongly…
20. School effectiveness Do differences in exam results support inferences about school quality?
Key issues:
Value-added
Sensitivity to instruction
Learning is slower than generally assumed
Sensitivity to instruction of tests is exacerbated by test-construction procedures
Result: invalid attributions about the effects of schooling
21. Learning is hard and slow…
22. Why does this matter? In England, school-level effects account for only 7% of the variability in GCSE scores.
In terms of value-added, there is no statistically significant difference between the middle 80 percent of English secondary schools
Correlation between teacher quality and student progress is low:
Average cohort progress: 0.3 sd per year
Good teachers (+1 sd) produce 0.4 sd per year
Poor teachers (-1 sd) produce 0.2 sd per year
23. So… Although teacher quality is the single most important determinant of student progress…
…the effect is small compared to the accumulated achievement over the course of a learner’s education…
…so that inferences that school outcomes are indications of the contributions made by the school are unlikely to be valid.
24. FunctionQualityFormatScopeAuthorityLocus
25. Item formats “No assessment technique has been rubbished quite like multiple choice, unless it be graphology” Wood, 1991, p. 32)
Myths about multiple-choice items
They are biased against females
They assess only candidates’ ability to spot or guess
They test only lower-order skills
26. Comparing like with like… Constructed-response items
Can be improved through guidance to markers
Can be developed cheaply, but are expensive to score
For a one-hour year-cohort assessment in England
Development: Ł5 000
Scoring: Ł1 000 000
Multiple-choice items
Cannot be improved through guidance to markers
Can be developed cheaply, but are cheap to score
For a one-hour year-cohort assessment in England
Development: Ł1 000 000?
Scoring: Ł5 000
27. Mathematics 1 What is the median for the following data set?
38 74 22 44 96 22 19 53
22
38 and 44
41
46
77
This data set has no median Q8-61-05
Key: C
(Mis)conceptions: median has to be a number in the data set, cannot calculate median with an even number of elements
(B) And (F) would not be included – stick out as longer – neither correctQ8-61-05
Key: C
(Mis)conceptions: median has to be a number in the data set, cannot calculate median with an even number of elements
(B) And (F) would not be included – stick out as longer – neither correct
28. Mathematics 2 What can you say about the means of the following two data sets?
Set 1: 10 12 13 15
Set 2: 10 12 13 15 0
The two sets have the same mean.
The two sets have different means.
It depends on whether you choose to count the zero. Q4-67-02
Key: B
(Mis)conception – added a zero to data set does not impact the mean – A and C variations of each other
Q4-67-02
Key: B
(Mis)conception – added a zero to data set does not impact the mean – A and C variations of each other
29. Mathematics 3 Q4-52-03
Key: B, C, D, E
(Mis)conception – any line at 45 degree to horizontal.What is missed is also important – C and E could not be diagonals because of what is NOT selected
MORE THAN ONE CORRECT ANSWER – not standard number of answer choices
Combination of responses is also important – (b) is different from (c) and (d) and (e)Q4-52-03
Key: B, C, D, E
(Mis)conception – any line at 45 degree to horizontal.What is missed is also important – C and E could not be diagonals because of what is NOT selected
MORE THAN ONE CORRECT ANSWER – not standard number of answer choices
Combination of responses is also important – (b) is different from (c) and (d) and (e)
30. Science The ball sitting on the table is not moving. It is not moving because:
no forces are pushing or pulling on the ball.
gravity is pulling down, but the table is in the way.
the table pushes up with the same force that gravity pulls down
gravity is holding it onto the table.
there is a force inside the ball keeping it from rolling off the table
31. Science 2 You look outside and notice a very gentle rain. Suddenly, it starts raining harder. What happened?
A cloud bumped into the cloud that was only making a little rain.
A bigger hole opened in the cloud, releasing more rain.
A different cloud, with more rain, moved into the area.
The wind started to push more water out of the clouds. Q4-42-03
Key: C
(Mis)conception: rain occurs when clouds bang into each other; rain comes from holes in the clouds;
Answer D not a misconception but plausible logic might get from high winds associated with low pressure areas and hence with rain – but it is not a causal relationship
Answers stated as a student might say themQ4-42-03
Key: C
(Mis)conception: rain occurs when clouds bang into each other; rain comes from holes in the clouds;
Answer D not a misconception but plausible logic might get from high winds associated with low pressure areas and hence with rain – but it is not a causal relationship
Answers stated as a student might say them
32. Science 3 Jenna put a glass of cold water outside on a warm day. After a while, she could see small droplets on the outside of the glass. Why was this?
The air molecules around the glass condensed to form droplets of liquid
The water vapor in the air near the cold glass condensed to form droplets of liquid water
Water soaked through invisible holes in the glass to form droplets of water on the outside of the glass
The cold glass causes oxygen in the air to become water Key: BKey: B
33. Science 4 How could you increase the temperature of boiling water?
Add more heat.
Stir it constantly.
Add more water.
You can’t increase the temperature of boiling water.
34. Science 5 What can we do to preserve the ozone layer?
Reduce the amount of carbon dioxide produced by cars and factories
Reduce the greenhouse effect
Stop cutting down the rainforests
Limit the numbers of cars that can be used when the level of ozone is high
Properly dispose of air-conditioners and fridges
35. English Where would be the best place to begin a new paragraph?
36. English 2 In a piece of persuasive writing, which of these would be the best thesis statement?
The typical TV show has 9 violent incidents
There is a lot of violence on TV
The amount of violence on TV should be reduced
Some programs are more violent than others
Violence is included in programs to boost ratings
Violence on TV is interesting
I don’t like the violence on TV
The essay I am going to write is about violence on TV
37. History Why are historians concerned with bias when analyzing sources?
People can never be trusted to tell the truth
People deliberately leave out important details
People are only able to provide meaningful information if they experienced an event firsthand
People interpret the same event in different ways, according to their experience
People are unaware of the motivations for their actions
People get confused about sequences of events
38. FunctionQualityFormatScopeAuthorityLocus
39. The Lake Wobegon effect revisited
40. Effects of narrow assessment Incentives to teach to the test
Focus on some subjects at the expense of others
Focus on some aspects of a subject at the expense of others
Focus on some students at the expense of others (“bubble” students)
Consequences
Learning that is
Narrow
Shallow
Transient
41. FunctionQualityFormatScopeAuthorityLocus
42. Authority Reliability requires random sampling from the domain of interest
Increasing reliability requires increasing the size of the sample
Using teacher assessment in certification is attractive:
Increases reliability (increased test time)
Increases validity (addresses aspects of construct under-representation)
But problematic
Lack of trust (“Fox guarding the hen house”)
Problems of biased inferences (construct-irrelevant variance)
Can introduce new kinds of construct under-representation
43. FunctionQualityFormatScopeAuthorityLocus
44. Locus Using external markers to mark student assessments involves spending more money in order to deny teachers professional learning opportunities
Getting teachers involved in “common assessment”
Is not assessment for learning, nor formative assessment
But it is valuable, perhaps even essential, professional development
45. Final reflections
46. The challenge To design an assessment system that is:
Distributed
So that evidence collection is not undertaken entirely at the end
Synoptic
So that learning has to accumulate
Extensive
So that all important aspects are covered (breadth and depth)
Manageable
So that costs are proportionate to benefits
Trusted
So that stakeholders have faith in the outcomes
47. Constraints and affordances Beliefs about what constitutes learning;
Beliefs in the reliability and validity of the results of various tools;
A preference for and trust in numerical data, with bias towards a single number;
Trust in the judgments and integrity of the teaching profession;
Belief in the value of competition between students;
Belief in the value of competition between schools;
Belief that test results measure school effectiveness;
Fear of national economic decline and education’s role in this;
Belief that the key to schools’ effectiveness is strong top-down management;
48. The minimal take-aways… No such thing as a summative assessment
No such thing as a reliable test
No such thing as a valid test
No such thing as a biased test
“Validity including reliability”