Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Direc

1. Scaling and EquatingJoe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office of Superintendent of Public Instruction

2. Overview Scaling Definition Purposes Equating Definition Purposes Designs Procedures Vertical Scale

3. What is Scaling? Scaling is the process of associating numbers with the performance of examinees What does 400 mean in WASL? It is not a raw score but a scaled score.

4. Primary Score Scale Many educational tests use one primary score scale for reporting scores Raw scores, scaled scores, percentile WASL and WLPT-II use scaled scores

5. Activity Grade 3 Mathematics Items

6. G3 Math Items

7. Why Use a Scaled Score? Minimizing misinterpretations e.g. Emmy got 30 points last year and met the standard. I got 31points this year but did not meet the standard. Why? The cut score last year was 30 points and the cut score this year is 32points. Did you raise the standard?

8. Why Use a Scale Score? Facilitate meaningful interpretation Comparison of examinees� performance on different forms Tracking of trends in group performance over time Comparison of examinees� performance on different difficulty levels of a test

9. Raw Score and Scaled Score Linearly (Monotonic) related Based on Item Response Theory Ability Scale Each observed performance is corresponding to an ability value (theta) Scaled score = a + b *(theta)

10. Linear Transformation Simple linear trasformation: � Scaled Score= a + b*(ability) �� Two parameters are used to describe that relationship: a and b. We obtain some sample data and find the values of a and b that best fit the data to the linear regression model.

11. WASL 400 = a + b*(theta 1) 375 = a + b*(theta 2) Theta 1 and theta 2 are established by the standard setting committees. a and b are determined by solving the equations above.

12. WLPT-II Min Scaled Score = 300 Max Scaled Score = 900 300 = a + b*(theta 1) 900 = a + b*(theta 2)

13. WASL Scaling 375 is the cut between level 1 and level 2 for all grade levels and content areas 400 is the cut between level 2 and level 3 for all grade levels and content areas. Each grade/content has a separate scale (WASL) All grade levels are in the same scale (WLPT-II) - vertically linked

14. WASL

15. WLPT-II (Vertical Scale)

16. Equating

17. 17 Purpose of Equating Large scale testing programs use multiple forms of the same test Differences in item and test difficulties across forms must be controlled Equating is used to ensure that scale scores are equivalent across tests

18. 18 Requirements of Equating Four necessary conditions for equating (Lord, 1980): Ability - Equated tests must measure the same construct (ability) Equity � After transformation, the conditional frequencies for each test are same Population invariance Symmetry

19. 19 Ability - Equated Tests Must Measure the Same Construct (Ability) Item and test specifications are based on definitions of the abilities to be assessed Item specifications define how the abilities are shown Test specifications ensure representation of all aspects of the construct Tests to be equated should measure the same abilities in the same ways

20. 20 Equity Scales on the tests to be equated should be strictly parallel after equating Frequency distributions should be roughly equivalent after transformation

21. 21 Population Invariance The outcome of the transformation must be the same regardless of which group is used as the anchor If score Y1 on Y is equated to score X1 on X, the result should be the same as if score X1 is equated to score Y1 If a score of 10 on 2007 Mathematics is equivalent to a score of 11 on 2006 Mathematics (when 2006 is used as the anchor), then a score of 11 on 2006 Mathematics should be equivalent to a score of 10 on 2007 Mathematics (when 2007 is used as the anchor)

22. 22 Symmetry The function used to transform the Y scale to the X scale is the inverse of the function used to transform the X scale to the Y scale If the 2007 Mathematics scale is equated to 2006 Mathematics scale, the function used to do the equating should be the inverse of the function used when the 2006 Mathematics scale is equated to the 2007 Mathematics scale regressionregression

23. 23 Equating Design Used in WASL Common-Item Nonequivalent Groups Design (Kolen & Brennan, 1995) A set of items in common (anchor items) Different groups of examinees (in different years)

24. 24 Equating Method Item Response Theory Equating uses a transformation from one scale to the other to make score scales comparable to make item parameters comparable

25. Equating of WASL The items on a WASL test differ from year-to-year (within grade and content area) Some items on the WASL have appeared in earlier forms of the test, and item calibrations (�b� difficulty/step values) were established. These are called �Anchor Items�. Each year�s WASL is equated to the previous year�s scale using these anchor items.

26. Equating Procedure Identify anchor item difficulties from bank. Calibrate all items on current test form without fixing anchor item difficulties. Calculate mean of anchor items using bank difficulties. Calculate mean of anchor items using calibrated difficulties from current test Add constant to current test difficulties so the mean equals mean from bank values.

27. Equating Procedure For each anchor item, subtract current difficulty from the bank difficulty (after adding the constant). Drop the item with largest absolute difference greater than 0.3 from consideration as an anchor item. Repeat steps 3-7 using remaining anchor items.

28. Equating Example



31. Transformed ScoresRaw-to-Theta-to-Scale Procedures Calibration software provides a Raw-to-Theta look-up table. Theta-to-Scale Score transformation is applied (derived from Theta at 3 cut-points from Standard Setting committee: ?(L2) ? 375 ?(L3) ? 400 ?(L4) ? SS, obtained by solving for ?(L4) in SS=m*?+b derived from ?(L2) and ?(L3)

32. Transformed Scores Example

33. Theta-to-SS Transformations

34. Transformed Scores

35. How to Determine Cut Score (Until 2006) If there is 400, the cut score is 400 If 400 does not exist, the nearest score becomes the cut score e.g. - 397, 400, 402: 400 is the cut score - 398, 401, 403: 401 is the cut score - 399, 402, 405: 399 is the cut score

36. How to Determine Cut Score (2007) If there is 400, the cut score is 400 If 400 does not exist, the next lowest score becomes the cut score e.g. - 397, 400, 402: 400 is the cut score - 398, 401, 403: 398 is the cut score - 399, 402, 405: 399 is the cut score

37. Vertical Scaling

38. Vertical Scale Examinee performance across grade levels on a single scale Measure individual student growth Locate all items across grade level on a single scale Proficiency standard from different grade levels to a single scale

39. Vertical Scaling vs. Equating Equating: scores on different test forms to be used interchangeably within grade level Vertical scaling: Performance across all grade levels on the same scale Measure students� growth Not equating

40. Data Collection Design Common item design Common items between adjacent grade levels Select appropriate level items to each grade Equivalent group design Same examinees Take on-grade test or off-grade test (usually lower grade test)

41. Common Item Design (WASL)

42. Previous Vertical Linking Study Math in Grades 3, 4, and 5 Purpose of the study How much are students growing over time? What is the precision of these estimates?

43. Data The data consists of items used in the pilot test for Grades 3 and 5 in 2004 and 2005 Operational data for Grade 4 in 2005

44. Linking Design Items across all forms in three grades Each form within grade includes a common block of items Common item non-equivalent groups design

45. Common Item Design (WASL)

46. Item Review (Item Means)

47. Item Review

48. Results Comparing the p-values for the linking items across grades suggests some instability Growth is larger from grades 3 to 4 than grades 4 to 5 Pilot data vs. operational data Motivation factor (G4 to G5) Backward Equating

49. Future Plan Vertical linking study will be conducted in January 2008 using the 2007 reading WASL. The results will be presented next year.

Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Direc

Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Direc

Presentation Transcript

Equating And Scaling

Tony Thacker Assistant State Superintendent Office of Evaluation, Accountability and Support

The Importance of Information Privacy and Confidentiality Prepared by Donna Scanlon, Assistant Superintendent Hampde

Eugene Young Assistant Superintendent of Operations and Student Services

Student Learning and Assessment

Standard Error of Equating

Robert Hull Assistant Superintendent of Schools Division of Curriculum and Instruction

Assessment and Student Learning: Direct and Indirect Measures of Assessment

Joe Student

Linda Theret, Assistant Superintendent Curriculum and Instruction

Bethany Rayl , Assistant Superintendent for Curriculum and Instruction

Student assessment information

Indiana Assessment Overview Wes Bruce Assistant Superintendent Indiana DOE

Lee Worden Head of Access and Student Recruitment

Vertical Scaling: A Comparison of Equating Methods

Assessment of Student Teaching and Internships

OSPI Assessment and Student Information Update 2013-14

Joe Willhoft Assistant Superintendent of Assessment and Student Information Judy Kraft

Washington and Lee Student Consulting

Lisa Muller, Assistant Superintendent, Curriculum and School Improvement

Test co-calibration and equating