VERTICAL SCALING

VERTICAL SCALING Mister Ibik Arizona State University Presentation to the TNE Assessment Committee, October 30, 2006

Scaling Definition: Scaling is a process in which raw scores on a test are transformed to a new scale with desired attributes (e.g., mean, SD)

Scaling Purposes: 1. Reporting scores on a convenient metric 2. Providing a common scale on which scores from different forms of a test can be reported (after equating or linking)

Scaling There are two distinct testing situations where scaling is needed

Scaling SITUATION 1 • Examinees take different forms of a test for security reasons or at different times of year • Forms are designed to the same specifications but may differ slightly in difficulty due to chance factors • Examinee groups taking the different forms are not expected to differ greatly in proficiency

Scaling SITUATION 2 • Test forms are intentionally designed to differ in difficulty • Examinee groups are expected to be of differing proficiency • EXAMPLE: test forms designed for different grade levels

EQUATING For SITUATION 1, we often refer to the scaling process as EQUATING Equating is the process of mapping the scores on Test Y onto the scale of Test X so that we can say what the score of an examinee who took Test Y would have been had the examinee taken Test X (the scores are exchangeable)

EQUATING This procedure is often called HORIZONTAL EQUATING

LINKING For SITUATION 2, we refer to the scaling process as LINKING, or scaling to achieve comparability This process is sometimes called VERTICAL EQUATING, although equating is not strictly possible in this case

REQUIREMENTS FOR SCALING • In order to places the scores on two tests on a common scale, the tests must measure the same attribute e.g., the scores on a reading test cannot be converted to the scale of a mathematics test

EQUATING DESIGNS FOR VERTICAL SCALING 1. COMMON PERSON DESIGN Tests to be equated are given to different groups of examinees with a common group taking both tests 2. COMMON ITEM (ANCHOR TEST) DESIGN Tests to be equated are given to different groups of examinees with all examinees taking a common subset of items (anchor items)

EQUATING DESIGNS FOR VERTICAL SCALING • EXTERNAL ANCHOR OR SCALING TEST DESIGN Different groups of examinees take different tests, but all take a common test in addition

Example of Vertical Scaling Design(Common Persons)

Example of Vertical Scaling Design(Common Items) Item Block 1 2 3 4 Year 1 Year 2 Year 3

Problems with Vertical Scaling • If the construct or dimension being measured changes across grades/years/ forms, scores on different forms mean different things and we cannot reasonably place scores on a common scale • May be appropriate for a construct like reading; less appropriate for mathematics, science, social studies, etc.

Problems with Vertical Scaling • Both common person and common item designs have practical problems of items that may be too easy for one group and too hard for the other • Must ensure that examinees have had exposure to content of common items or off-level test (cannot scale up, only down in common persons design)

Problems with Vertical Scaling • Scaled scores are not interpretable in terms of what a student knows or can do • Comparison of scores on scales that extend across several years is particularly risky

Example • For multiple–choice model (Thissen & Steinberg,1984) the probability of responding in response category k of item j as a function of proficiency (the item category response function) is:

A scale transformation • is computed using two sets of item parameter estimates on a group of common items. The two sets of parameter estimates are obtained using samples from different populations. One set of item parameter estimates (ˆajk, ˆbjk, ˆ djk) are on the target scale, and the other set of item parameter estimates (ˆajk, ˆbjk, ˆdjk) are on the current scale. • Assuming two tests used to obtain the two sets of item parameter estimates for the common items are developed under the multiple–choice model, transformations of item parameter estimates on a current θ scale to the target θ scale using a slope and intercept S and I of the scale transformation:

The parameters djk are not affected by scale transformations. Note that since the two sets of parameter estimates contain sampling error it will not be the case that ˆa∗jk = ˆajk, and ˆb∗jk =ˆbjk, for all j and k, even if S and I are the slope and intercept of the true scale transformation from the current to the target scale.

The characteristic curve • method of estimating a scale transformation finds values of S and I that minimize the squared difference in the item category response functions using the 8 parameter estimates on the target scale and the parameter estimates on the current scale transformed to the target scale:

References Baker,F. B. (1992). Equating tests under the graded response model. Applied Psychological Measurement, 16,87–96. Baker,F. B. (1993). Equating tests under the nominal response model. Applied Psychological Measurement. 17,239–251. Bock,R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37,29–51. Dennis,J. E.,& Schnabel,R. B. (1996). Numerical methods for unconstrained optimization and nonlinear equations. Philadelphia: Society for Industrial and Applied Mathematics. Haebara,T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22,144–149. Kolen,M. J.,& Brennan,R. L. (1995). Test equating: Methods and practices. New York: Springer–Verlag. Lord,F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Loyd,B. H.,& Hoover,H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17,179–193. Marco,G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14,139–160. Samejima,F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph,No. 17. Stocking,M. L.,& Lord,F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7,201–210. Thissen,D. (1991). MULTILOG user’s guide: Multiple, categorical item analysis and test scoring using item response theory [Computer program]. Chicago: Scientific Software International. Thissen,D.,& Steinberg,L. (1984). A response model for multiple choice items. Psychometrika, 49,501–519. Thissen,D.,& Steinberg,L. (1997). A response model for multiple–choice items. In W. J. van der Linden and R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 51–65). New York: Springer–Verlag.

VERTICAL SCALING