480 likes | 813 Views
Overview. ScalingDefinitionPurposesEquatingDefinitionPurposesDesignsProceduresVertical Scale. What is Scaling?. Scaling is the process of associating numbers with the performance of examineesWhat does 400 mean in WASL? It is not a raw score but a scaled score.. Primary Score Scale. Many educational tests use one primary score scale for reporting scoresRaw scores, scaled scores, percentileWASL and WLPT-II use scaled scores .
E N D
1. Scaling and EquatingJoe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office of Superintendent of Public Instruction
2. Overview Scaling
Definition
Purposes
Equating
Definition
Purposes
Designs
Procedures
Vertical Scale
3. What is Scaling? Scaling is the process of associating numbers with the performance of examinees
What does 400 mean in WASL? It is not a raw score but a scaled score.
4. Primary Score Scale Many educational tests use one primary score scale for reporting scores
Raw scores, scaled scores, percentile
WASL and WLPT-II use scaled scores
5. Activity
Grade 3 Mathematics Items
6. G3 Math Items
7. Why Use a Scaled Score? Minimizing misinterpretations
e.g. Emmy got 30 points last year and met the standard. I got 31points this year but did not meet the standard. Why?
The cut score last year was 30 points and the cut score this year is 32points. Did you raise the standard?
8. Why Use a Scale Score? Facilitate meaningful interpretation
Comparison of examinees’ performance on different forms
Tracking of trends in group performance over time
Comparison of examinees’ performance on different difficulty levels of a test
9. Raw Score and Scaled Score Linearly (Monotonic) related
Based on Item Response Theory Ability Scale
Each observed performance is corresponding to an ability value (theta)
Scaled score = a + b *(theta)
10. Linear Transformation Simple linear trasformation:
Scaled Score= a + b*(ability)
Two parameters are used to describe that relationship: a and b.
We obtain some sample data and find the values of a and b that best fit the data to the linear regression model.
11. WASL 400 = a + b*(theta 1)
375 = a + b*(theta 2)
Theta 1 and theta 2 are established by the standard setting committees.
a and b are determined by solving the equations above.
12. WLPT-II Min Scaled Score = 300
Max Scaled Score = 900
300 = a + b*(theta 1)
900 = a + b*(theta 2)
13. WASL Scaling 375 is the cut between level 1 and level 2 for all grade levels and content areas
400 is the cut between level 2 and level 3 for all grade levels and content areas.
Each grade/content has a separate scale (WASL)
All grade levels are in the same scale (WLPT-II) - vertically linked
14. WASL
15. WLPT-II (Vertical Scale)
16. Equating
17. 17 Purpose of Equating Large scale testing programs use multiple forms of the same test
Differences in item and test difficulties across forms must be controlled
Equating is used to ensure that scale scores are equivalent across tests
18. 18 Requirements of Equating Four necessary conditions for equating
(Lord, 1980):
Ability - Equated tests must measure the same construct (ability)
Equity – After transformation, the conditional frequencies for each test are same
Population invariance
Symmetry
19. 19 Ability - Equated Tests Must Measure the Same Construct (Ability) Item and test specifications are based on definitions of the abilities to be assessed
Item specifications define how the abilities are shown
Test specifications ensure representation of all aspects of the construct
Tests to be equated should measure the same abilities in the same ways
20. 20 Equity Scales on the tests to be equated should be strictly parallel after equating
Frequency distributions should be roughly equivalent after transformation
21. 21 Population Invariance The outcome of the transformation must be the same regardless of which group is used as the anchor
If score Y1 on Y is equated to score X1 on X, the result should be the same as if score X1 is equated to score Y1
If a score of 10 on 2007 Mathematics is equivalent to a score of 11 on 2006 Mathematics (when 2006 is used as the anchor), then a score of 11 on 2006 Mathematics should be equivalent to a score of 10 on 2007 Mathematics (when 2007 is used as the anchor)
22. 22 Symmetry The function used to transform the Y scale to the X scale is the inverse of the function used to transform the X scale to the Y scale
If the 2007 Mathematics scale is equated to 2006 Mathematics scale, the function used to do the equating should be the inverse of the function used when the 2006 Mathematics scale is equated to the 2007 Mathematics scale regressionregression
23. 23 Equating Design Used in WASL Common-Item Nonequivalent Groups Design (Kolen & Brennan, 1995)
A set of items in common (anchor items)
Different groups of examinees
(in different years)
24. 24 Equating Method Item Response Theory Equating uses a transformation from one scale to the other
to make score scales comparable
to make item parameters comparable
25. Equating of WASL The items on a WASL test differ from year-to-year (within grade and content area)
Some items on the WASL have appeared in earlier forms of the test, and item calibrations (“b” difficulty/step values) were established. These are called “Anchor Items”.
Each year’s WASL is equated to the previous year’s scale using these anchor items.
26. Equating Procedure Identify anchor item difficulties from bank.
Calibrate all items on current test form without fixing anchor item difficulties.
Calculate mean of anchor items using bank difficulties.
Calculate mean of anchor items using calibrated difficulties from current test
Add constant to current test difficulties so the mean equals mean from bank values.
27. Equating Procedure For each anchor item, subtract current difficulty from the bank difficulty (after adding the constant).
Drop the item with largest absolute difference greater than 0.3 from consideration as an anchor item.
Repeat steps 3-7 using remaining anchor items.
28. Equating Example
29. Equating Example
30. Equating Example
31. Transformed ScoresRaw-to-Theta-to-Scale Procedures Calibration software provides a Raw-to-Theta look-up table.
Theta-to-Scale Score transformation is applied (derived from Theta at 3 cut-points from Standard Setting committee:
?(L2) ? 375
?(L3) ? 400
?(L4) ? SS, obtained by solving for ?(L4) in SS=m*?+b derived from ?(L2) and ?(L3)
32. Transformed Scores Example
33. Theta-to-SS Transformations
34. Transformed Scores
35. How to Determine Cut Score (Until 2006) If there is 400, the cut score is 400
If 400 does not exist, the nearest score becomes the cut score
e.g.
- 397, 400, 402: 400 is the cut score
- 398, 401, 403: 401 is the cut score
- 399, 402, 405: 399 is the cut score
36. How to Determine Cut Score (2007) If there is 400, the cut score is 400
If 400 does not exist, the next lowest score becomes the cut score
e.g.
- 397, 400, 402: 400 is the cut score
- 398, 401, 403: 398 is the cut score
- 399, 402, 405: 399 is the cut score
37.
Vertical Scaling
38. Vertical Scale Examinee performance across grade levels on a single scale
Measure individual student growth
Locate all items across grade level on a single scale
Proficiency standard from different grade levels to a single scale
39. Vertical Scaling vs. Equating Equating: scores on different test forms to be used interchangeably within grade level
Vertical scaling:
Performance across all grade levels on the same scale
Measure students’ growth
Not equating
40. Data Collection Design Common item design
Common items between adjacent grade levels
Select appropriate level items to each grade
Equivalent group design
Same examinees
Take on-grade test or off-grade test (usually lower grade test)
41. Common Item Design (WASL)
42. Previous Vertical Linking Study Math in Grades 3, 4, and 5
Purpose of the study
How much are students growing over time?
What is the precision of these estimates?
43. Data The data consists of items used in the pilot test for Grades 3 and 5 in 2004 and 2005
Operational data for Grade 4 in 2005
44. Linking Design Items across all forms in three grades
Each form within grade includes a common block of items
Common item non-equivalent groups design
45. Common Item Design (WASL)
46. Item Review (Item Means)
47. Item Review
48. Results Comparing the p-values for the linking items across grades suggests some instability
Growth is larger from grades 3 to 4 than grades 4 to 5
Pilot data vs. operational data
Motivation factor (G4 to G5)
Backward Equating
49. Future Plan Vertical linking study will be conducted in January 2008 using the 2007 reading WASL.
The results will be presented next year.