1 / 26

Adventures in Equating Land:

Adventures in Equating Land:. Facing the Intra-Individual Consistency Index Monster *. *Louis Roussos retains all rights to the title. Overview of Equating Designs and Methods. Designs Single Group Random Groups Common Item Nonequivalent Groups (CING) Methods Mean Linear Equipercentile

lizina
Download Presentation

Adventures in Equating Land:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster* *Louis Roussos retains all rights to the title

  2. Overview of Equating Designs and Methods • Designs • Single Group • Random Groups • Common Item Nonequivalent Groups (CING) • Methods • Mean • Linear • Equipercentile • IRT True or Observed

  3. Guidelines for Selecting Common Items for Multiple-Choice (MC) Only Exams • Representative of the total test (Kolen & Brennan, 2004) • 20% of the total test • Same item positions • Similar average/spread of item difficulties (Durans, Kubiak, & Melican, 1997) • Content representative (Klein & Jarjoura, 1985)

  4. Challenges in Equating Mixed-Format Tests(Kolen & Brennan, 2004; Muraki, Hombo, & Lee, 2000) • Constructed Response (CR) scored by raters • Small number of tasks • Inadequate sampling of construct • Changes in construct across forms • Common Items • Content/difficulty balance of common items • MC only may result in inadequate representation of groups/construct • IRT • Small number of tasks may result in unstable parameter estimates • Typically assume a single dimension underlies both item types • Format Effects

  5. Current Research • Number of CR Items • Smaller RMSD with larger numbers of items and/or score points (Li and Yin, 2008; Fitzpatrick and Yen, 2001) • Misclassification (Fitzpatrick and Yen, 2001) • Fewer than 12 items, more score points resulted in smaller error rates • Greater than 12 items, error rates less than 10% regardless of score points • Trend Scoring (Tate, 1999, 2000; Kim, Walker, McHale, 2008) • Rescoring samples of CR items • Smaller bias and equating error

  6. Cont. • Format Effects (FE) • MC and CR measure similar constructs (Ercikan et al., 1993; Traub, 1993) • Males scored higher on MC; females higher on CR ( DeMars, 1998; Garner & Engelhard, 1999) • Kim and Kolen, 2006 • Narrow-range tests (e.g., credentialing) • Wide-range tests (e.g., achievement) • Individual Consistency Index (Tatsuoka & Tatsuoka, 1982) • Detecting aberrant response patterns • Not specifically in the context of mixed-format tests

  7. Purpose and Research Questions Purpose: Examine the impact of equating mixed format tests when student subscores differ across item types. Specifically, • To what extent does the intra-individual consistency of examinee responses across item formats impact equating results? • How does the selection of common items differentially impact equating results with varying levels of intra-individual consistency?

  8. Data • “Old Form” (OL) treated as “truth” • Large-scale 6th grade testing program • Mathematics • 54 point test • 34 multiple choice (MC) • 5 short answer (SA) • 5 constructed response (CR) worth 4 points each • Approx. 70,000 examinees • “New Form” (NE) • Exactly the same items as OL • Samples of examinees from OL

  9. NE (new form) Samples of 3,000 Examinees OL (old form) All Examinees 2006-07 Scoring Test 39 Items 2006-07 Scoring Test 39 Items Both OL and NE contain the exact same items Only difference between the forms are the examinees

  10. Intra-Individual Consistency • Consistency of student responses across formats • Regression of dichotomous item subscores (MC and SA) onto polytomous item subscores (CR) • Standardized residuals • Range from approximately -4.00 to +8.00 • Example: Index of +2.00 • Student subscores on CR under-predicted by two standard deviations based on MC subscores

  11. Samples • Three groups of examinees based on intra-individual consistency index • Below -1.50 (NEG) • -1.50 to +1.50 (MID) • Above +1.50 (POS) • 3,000 examinees per sample • Sampled from each group based on percentages • Samples selected to have same quartiles and median as whole group of examinees

  12. Sampling Conditions • 60/20/20 • 60% sampled from one of the groups (i.e., NEG, MID, POS) • 20% sample from each of the remaining groups • Repeated for each of the three groups • 40/30/30

  13. Common Items • Six sets of common items • MC only (12 points) • CR only (12 points) • MC (4) and CR (8) • MC (8) and CR (4) • MC (4), CR (4), and SA (4) • MC (7), CR (4), and SA (1) • Representative of total test in terms of content, difficulty and length

  14. Equating • Common-item nonequivalent groups design • Item parameters calibrated using Parscale 4.1 • 3-parameter logistic model (3PL) for MC items • 2PL model for SA items • Graded Response Model for CR items • IRT scale transformation • Mean/mean, mean/sigma, Stocking-Lord, and Haebara • IRT true score equating

  15. Equating OL and NE All items shared in common “Common” Items OL NE Equating conducted using only a selection of items treated as common “Truth” established by equating NE to OL using all items as common items

  16. Evaluation • Bias and RMSE • At each score point • Averaged over score points • Classification Consistency

  17. Results: 60% Mid

  18. Results: 40% Mid

  19. In the extreme…

  20. Across the Score Scale: Average Bias

  21. Across the Score Scale: Average RMSE

  22. Across the Score Scale: Misclassification Rates

  23. Classification Consistency: Proficient

  24. Discussion • Different equating results based on sampling conditions • Differences more exaggerated when using common items sets with mostly CR items • Mid 60 most similar to data, small differences across common item selections

  25. Limitations and Implications • Limitations • Sampling conditions • Common item selections • Only one equating method • Implications for future research • Sampling conditions, common item selections, additional equating methods • Other content areas and grade levels • Other testing programs • Simulation studies

  26. Thanks! • Rob Keller • Mike, Louis, Won, Candy, and Jessalyn

More Related